General description of the dataset

The dataset contains information on H1B applications by several companies in the USA. These are companies willing to hire foreign workers. There are 1,870,355 observations and 7 variables which are:

The information contained in the dataset span from 2012 to 2016.


Here is a snapshot of the first 10 rows of the dataset.

employer job.title base.salary location submit.date start.date case.status submit.year
RAVELLO SOLUTIONS LLC .NET SOFTWARE ENGINEER 65000 ATLANTA, GA 2012-11-06 2012-11-19 CERTIFIED 2012
INFO RETAIL INC ACCOUNT PLANNER/ACCOUNT MANAGER 116000 ATLANTA, GA 2012-11-15 2012-11-26 CERTIFIED 2012
IBM INDIA PRIVATE LIMITED ADVISORY CONSULTANT 70288 ATLANTA, GA 2012-12-31 2013-03-01 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY EXECUTIVE 260000 ATLANTA, GA 2012-11-23 2012-11-26 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY EXECUTIVE 260000 ATLANTA, GA 2012-12-04 2012-12-04 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY MANAGER 115000 ATLANTA, GA 2012-11-28 2013-01-02 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY MANAGER 141000 ATLANTA, GA 2012-12-13 2013-04-30 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY SENIOR 90000 ATLANTA, GA 2012-11-30 2012-12-03 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY SENIOR 110000 ATLANTA, GA 2012-09-27 2012-11-20 CERTIFIED 2012
ERNST & YOUNG US LLP ADVISORY SENIOR MANAGER 180000 ATLANTA, GA 2012-11-30 2012-12-03 CERTIFIED 2012

Employer

There are 123,304 distinct employers in the dataset.

Job title

There are 191,207 distinct job titles in the dataset.

Base salary

Ironically, the lowest base salary in the dataset is $0. There are 13 instances in which a company applied for an H1B visa regarding a position, which was not going to yield any revenue to the foreign worker (equivalent to volunteering work or unpaid internships). The 13 cases are from 12 distinct companies (the name of one company was omitted). The positions for these applications are varied (e.g. architect, computer programmer, civil engineer, …) and, as expected, all the cases were either withdrawn or denied.

employer job.title base.salary location submit.date start.date case.status submit.year
PHYSICIAN MULTISERVICES LTD ACCOUNTING, MANAGERIAL 0 BROOKLYN, NY 2013-04-04 2013-10-01 DENIED 2013
AAI ARCHITECTS PC ARCHITECT 0 NEW YORK CITY, NY 2013-03-04 2013-03-14 DENIED 2013
GLOBAL TPA LLC INFORMATION SYSTEMS ANALYST 0 TAMPA, FL 2013-06-17 2013-07-16 WITHDRAWN 2013
CML MEDIA CORP SOFTWARE DEVELOPER 0 COSTA MESA, CA 2014-05-14 2014-10-01 DENIED 2014
PARKER MESSANA & ASSOCIATES INC CIVIL ENGINEER 0 FEDERAL WAY, WA 2014-07-15 2014-09-01 DENIED 2014
SOUTHWESTERN CONSOLIDATED DIRECTORY CO INC COMPUTER PROGRAMMERS 0 LIVINGSTON, TX 2014-10-06 2014-10-14 WITHDRAWN 2014
DELL MARKETING LP NETWORK SUPPORT ADVISOR 0 PLANO, TX 2014-02-10 2014-08-06 WITHDRAWN 2014
0 SAN FRANCISCO, CA 2014-05-09 2013-05-06 WITHDRAWN 2014
STOBI LLC COMMISSIONING ENGINEER 0 TAMPA, FL 2014-12-09 2015-01-20 DENIED 2014
TECHNOPLUS INC 0 COPPELL, TX 2016-03-03 2016-11-04 WITHDRAWN 2016
TECHMATRIX INC COMPUTER PROGRAMMER 0 NORTH BRUNSWICK, NJ 2016-02-13 2016-11-04 WITHDRAWN 2016
SUPREME COURT OF THE STATE OF NEW YORK LAW LIBRARY RESEARCHER 0 NEW YORK CITY, NY 2016-02-23 2015-12-15 DENIED 2016
IBM CORPORATION STAFF SOFTWARE ENGINEER 0 WAYNE, PA 2016-03-28 2016-09-09 WITHDRAWN 2016


The second lowest base salary is $300. The positions that earn this salary are also varied; however, the companies applying for them are not completely distinct. The salary amount is quite low and points at the fact that the corresponding positions must be intership positions. As a matter of fact, one of the job titles is “STOCKBROKER TRAINEE”. Surprisingly enough, out of 13 applications in this case, only 4 were denied.

employer job.title base.salary location submit.date start.date case.status submit.year
SILENT MODELS USA LLC FASHION MODEL 300 NEW YORK, NY 2012-12-11 2013-01-02 CERTIFIED 2012
LDRK MEDIA LLC VIDEO EDITING AND COMMUNITY OUTREACH 300 GEORGETOWN, TX 2013-05-08 2013-06-01 DENIED 2013
SPARTAN CAPITAL SECURITIES LLC STOCKBROKER TRAINEE 300 GARDEN CITY, NY 2013-01-25 2013-03-01 DENIED 2013
SILENT MODELS USA LLC FASHION MODEL 300 NEW YORK, NY 2013-03-07 2013-09-06 CERTIFIED 2013
FAIRWAY COUNSELING INC PSYCHIATRIC PHYSICIAN 300 ALEXANDRIA, LA 2014-02-14 2014-03-01 CERTIFIED 2014
RUFUS-ISAACS ACLAND & GRANTHAM LLP LITIGATION ATTORNEY 300 BEVERLY HILLS, CA 2014-01-22 2014-02-10 CERTIFIED 2014
RUFUS-ISAACS ACLAND & GRANTHAM LLP LITIGATION ATTORNEY 300 BEVERLY HILLS, CA 2014-01-22 2014-02-10 CERTIFIED 2014
HATTIESBURG CLINIC PA VASCULAR SURGEON & SPECIALTY MEDICINE CONSULTANT 300 HATTIESBURG, MS 2014-05-27 2014-06-09 CERTIFIED 2014
HATTIESBURG CLINIC P A VASCULAR SURGEON & SPECIALITY MEDICINE CONSULTANT 300 HATTIESBURG, MS 2015-02-11 2015-03-01 CERTIFIED 2015
SEASIDE HCBS LLC PSYCHIATRIST 300 ALEXANDRIA, LA 2016-07-11 2016-07-11 DENIED 2016
SEASIDE HCBS LLC PSYCHIATRIST 300 ALEXANDRIA, LA 2016-07-15 2016-07-25 CERTIFIED 2016
SEASIDE HCBS LLC PSYCHIATRIST 300 KENNER, LA 2016-07-11 2016-07-11 DENIED 2016
SEASIDE HCBS LLC PSYCHIATRIST 300 KENNER, LA 2016-07-15 2016-07-25 CERTIFIED 2016


The highest base salary of the dataset is $7,278,872,788 and the employer for the position is IBM. The application is for an appointment as Senior Software Engineer, but it was eventually withdrawn. It’s highly likely that this is a typo because 7 billion dollars is just too high a number to be a yearly income.

employer job.title base.salary location submit.date start.date case.status submit.year
IBM CORPORATION SENIOR SOFTWARE ENGINEER 7278872788 PHILADELPHIA, PA 2014-09-15 2015-03-15 WITHDRAWN 2014


Density functions for the base salary

It would be ideal to show the density functions of the salaries for each year; however, the salary values are very dispersed (i.e. the presence of outliers) and the resulting PDFs have no discernable information that can be derived from them.

These density functions are not useful because the salaries cover a significant range (from $0 to 7 billion dollars) and because the top salaries are significanly higher than most salaries. As a matter of fact, the proportion of the dataset with a base salary higher than $250k is only about 0.04%. Therefore, using $250k as a cut-off point does not affect the PDFs significantly and makes them more useful to understand the salary distribution.


Case statuses

Bar plots showing the frequencies of each case status by year show that most H1B applications are certified (i.e. accepted). The cases which were either denied or withdrawn are very significantly lower in numbers than the certified cases.


The submit date/start date spread

The processing time for H1B petitions is generally between 2 and 6 months. Companies that needed to hire workers in a hurry could pay a fee for “premium processing”, which decreased the processing time to 15 calendar days. However, the “premium processing” option was recently suspended.

Calculating the difference between the appointment’s start date and the date of application (let’s call it the date spread) can provide insight into how early/late companies file for an H1B for their potential foreign employees.

Surprisingly, the dataset has many instances of “negative” date spreads! This means that many companies applied for an H1B visa after the appointment’s start date. The most notable case is in 2016. A company in Texas applied for an H1B visa 2338 days after the appointment’s start date (The appointment’s start date was in September of 2009 and the company submitted the application in February of 2016) - the petition was denied. There are 662 cases of negative date spreads and they were mostly either denied or withdrawn. Here are the case status distributions:

##   case.status count
## 1   CERTIFIED     2
## 2      DENIED   532
## 3   WITHDRAWN   128

The surprisingly 2 certified cases are the following:

## # A tibble: 2 × 9
##                       employer                job.title base.salary
##                          <chr>                    <chr>       <dbl>
## 1 TECH MAHINDRA (AMERICAS) INC       PROGRAMMER ANALYST       75028
## 2       IGATE TECHNOLOGIES INC COMPUTER SYSTEMS ANALYST       76600
## # ... with 6 more variables: location <chr>, submit.date <date>,
## #   start.date <date>, case.status <chr>, submit.year <dbl>,
## #   date.spread <time>

On the other end of the spectrum, there are also companies which submitted H1B applications at dates which are years before the appointment’s start date. The most prominent case is a Tae Kwon Do (Korean Martial Arts) Academy in Chandler, AZ, which submitted an application in October of 2012 for an appointment in November 2016 (4 years earlier). On a side note, I would argue that the legendary self-discipline of martial artists can be seen in every area of their lives… including administrative tasks (lol). Unfortunately, this petition was denied!

## # A tibble: 1 × 9
##                    employer                                   job.title
##                       <chr>                                       <chr>
## 1 KOREA TAE KWON DO ACADEMY TAEKWONDO (MARTIAL ARTS) GRANDMSTER (OWNER)
## # ... with 7 more variables: base.salary <dbl>, location <chr>,
## #   submit.date <date>, start.date <date>, case.status <chr>,
## #   submit.year <dbl>, date.spread <time>

Following are the date spread distributions by year; however, their odd-looking nature is due to the many outliers in the dataset.

The density functions are not significantly affected if they are produced by only focusing on the observations with a date spread between -50 and 200.

The PDFs have two peaks. The first peak (closer to 0) represent applications with premium processing, which was still available between 2012 and 2016. The second peak occurs around 175 days from the appointment’s start date, which is equivalent to about 6 months from the start date. 6 months is the ceiling time period for H1B processing.

Further analysis

More analysis can (and should) be conducted on the data that will include location and job types. The analysis involving job types, in particular, will require more study of the different job titles. This is especially true because the raw dataset contains 191,207 distinct job titles; however, a way can be found to create job groups by field. An example would be that an instructor and a lecturer would be in the “education” group.

My goal, with this document, was to give you a general sense of the dataset.