STOR 320 Final Project

Overview and Motivation:

According to the Bureau of Labor Statistics, roughly 17% of the total United States workforce is comprised of foreign born persons. Obtaining permission to work in the United States as a foreigner is a tedious process, requiring a government visa. Each year, about 350,000 people apply for an H-1B visa, a distinct type of visa that allows U.S. employers to hire foreign workers in special occupations. Regulated by United States Citizenship and Immigration Services, special occupations are defined as “requiring theoretical and practical application of a body of highly specialized knowledge in a field of human endeavor.” Professional fields that typically fall under special occupations include medicine, engineering, and mathematics among other sectors. Furthermore, H-1B candidates are required to have an education level of at least a bachelor’s degree, though usually candidates have advanced degrees. To be eligible to receive an H-1B visa, the employer of the candidates must first submit a Labor Condition Application (LCA) for certification.

Using statistical analysis, the general objective of this project is to determine what factors influence the probability that H-1B candidates obtain LCA certification by building binary classification models. In conducting the study, I utilize a trimmed dataset of H-1B LCAs filed from 2011 to 2017 with the Office of Foreign Labor Certification. Apart from indicating if an application received certification, this dataset also includes variables such as profession, state, and salary among other factors. Because the initial government sourced dataset is so large, I trims it down in order to achieve computational ability. I take precaution to maintain the integrity and improve the efficiency of the dataset by taking a random sample from within the data and scrubbing away outliers and missing values.

Initial Questions:

At the beginning of this project, I want to explore these three aspects of the dataset:

Determine the important factors or criteria that influence or determine the eligibility of an immigrant worker to apply for a work visa (H-1B):

Construct a logistic regression model in order to determine the important factors that determine eligibility
Test the validity of each model constructed through methods such as ANOVA analysis and error measuring, and select the best model.

Explore the change in demands and supplies of immigrant workers in various industries and regions:

Plot the overall trend of the number of LCA filed from 2011 to 2017. Pay attention to any sudden spike or dive in the data.
Find out the top 10 job types that are in demand, plot their overall trends of the number of LCA filed from 2013 to 2017 to see if their popularities are a sustainable fashion or a recent phenonomenon.
Create a heat map that shows up all the geographic locations of the employers in each year, 2013 to 2017, and compare all the maps to see if there is any significant change in the locations or any interesting pattern.

Explore the change in the certification rate of eligibility across states and over time:

Calculate the overall certification rates for each year (initial def. number of cases certified/ total number of cases filed).

Data Cleaning and Wrangling:

First, I will clean the dataset so it can be better used for future analysis and modeling purpose:

# Importing Data
# create an empty data frame
h1b = data.frame()
       
for (year in 14:11) {
    
    # read in files from 2011 to 2017 of h1b LCA records
    print(paste0("Reading ", year, " records .."))
    raw_data_path = paste0("h1b_raw_",year,".xlsx")
    new_df = read_xlsx(raw_data_path)

    # rename variables of dataframes for standardization
    print(paste0("Standardizing ", year, " records .."))
    if(year %in% c(14:11)) {
        
        new_df = new_df %>%
            mutate(CASE_NUMBER = LCA_CASE_NUMBER, 
                   CASE_STATUS = STATUS,
                   CASE_SUBMITTED = LCA_CASE_SUBMIT,
                   EMPLOYER_NAME = LCA_CASE_EMPLOYER_NAME,
                   EMPLOYMENT_START_DATE = LCA_CASE_EMPLOYMENT_START_DATE,
                   EMPLOYMENT_END_DATE = LCA_CASE_EMPLOYMENT_END_DATE,
                   EMPLOYER_POSTAL_CODE= LCA_CASE_EMPLOYER_POSTAL_CODE,
                   SOC_NAME = LCA_CASE_SOC_NAME,
                   SOC_CODE = LCA_CASE_SOC_CODE,
                   NAIC_CODE = LCA_CASE_NAICS_CODE,
                   JOB_TITLE = LCA_CASE_JOB_TITLE,
                   FULL_TIME_POSITION = FULL_TIME_POS, 
                   PREVAILING_WAGE = PW_1,
                   WORKSITE_CITY = LCA_CASE_WORKLOC1_CITY,
                   WORKSITE_STATE = LCA_CASE_WORKLOC1_STATE,
                   WILLFUL_VIOLATOR = NA)
    } else if (year == 15) {
        new_df = new_df %>% 
            mutate(TOTAL_WORKERS = `TOTAL WORKERS`,
                   WILLFUL_VIOLATOR = `WILLFUL VIOLATOR`)
    } else if (year == 17) {
        new_df = new_df %>% 
            mutate(NAIC_CODE = NAICS_CODE)
    }
    
    # add APL_YEAR & APL_MONTH columns to dataframes
    print(paste0("Adding new variables to ", year, " data frame..."))
    new_df = new_df %>%
        mutate(APL_YEAR = year(CASE_SUBMITTED),
               APL_MONTH = month(CASE_SUBMITTED))
    
    # Convert all times from character or POSITCX to Date of data earlier than 2015
    print(paste0("Transforming time variables in ", year, " data frame..."))
    if (year == 15) {
        new_df = new_df %>%
            mutate(CASE_SUBMITTED = 
                       as.Date(strptime(CASE_SUBMITTED, "%Y-%m-%d")),
                   DECISION_DATE = 
                       as.Date(strptime(DECISION_DATE, "%Y-%m-%d")),
                   EMPLOYMENT_START_DATE =
                       as.Date(strptime(EMPLOYMENT_START_DATE, "%m/%d/%Y")),
                   EMPLOYMENT_END_DATE = 
                       as.Date(strptime(EMPLOYMENT_END_DATE, "%m/%d/%Y")))
    } else {
         new_df = new_df %>%
            mutate(CASE_SUBMITTED = 
                       as.Date(strptime(CASE_SUBMITTED, "%Y-%m-%d")),
                   DECISION_DATE = 
                       as.Date(strptime(DECISION_DATE, "%Y-%m-%d")),
                   EMPLOYMENT_START_DATE =
                       as.Date(strptime(EMPLOYMENT_START_DATE, "%Y-%m-%d")),
                   EMPLOYMENT_END_DATE = 
                       as.Date(strptime(EMPLOYMENT_END_DATE, "%Y-%m-%d")))
    }
   
    # select only the relevant columns
    print(paste0("Select relevant variables from ", year, " data frame..."))
    new_df = new_df %>%
        select(CASE_NUMBER, 
               CASE_STATUS,
               CASE_SUBMITTED,
               APL_YEAR,
               APL_MONTH,
               DECISION_DATE,
               EMPLOYER_NAME,
               EMPLOYMENT_START_DATE,
               EMPLOYMENT_END_DATE,
               EMPLOYER_POSTAL_CODE,
               SOC_NAME,
               SOC_CODE,
               NAIC_CODE,
               JOB_TITLE,
               TOTAL_WORKERS,
               FULL_TIME_POSITION,
               PREVAILING_WAGE,
               WILLFUL_VIOLATOR,
               WORKSITE_CITY,
               WORKSITE_STATE)
    
    # merge data with already transformed data based on unique CASE_NUMBER
    print(paste0("Merging ", year, " data frame to h1b"))
    if (year == 17) {
        h1b = new_df
    } else {
        temp_non = anti_join(new_df,h1b,by = "CASE_NUMBER")
        h1b = rbind(h1b,temp_non)
    }
    
}

# Data Wrangling
# filter out records that are not from 2011 to 2017
h1b = h1b %>% 
    filter(APL_YEAR %in% c(2011:2017)) %>%
    mutate(APL_YEAR = factor(APL_YEAR), APL_MONTH = factor(APL_MONTH),
           SOC_NAME = factor(SOC_NAME), SOC_CODE = factor(SOC_CODE),
           NAIC_CODE = factor(NAIC_CODE), TOTAL_WORKERS = as.numeric(TOTAL_WORKERS),
           FULL_TIME_POSITION = factor(FULL_TIME_POSITION), PREVAILING_WAGE = as.numeric(PREVAILING_WAGE),
           WILLFUL_VIOLATOR = factor(WILLFUL_VIOLATOR), CASE_STATUS = factor(CASE_STATUS),
           WORKSITE_STATE = factor(WORKSITE_STATE))

# store employers that are WILLFUL_VIOLATOR into a separate data frame and remove the variable from h1b
wv = filter(h1b, WILLFUL_VIOLATOR == "Y")
h1b = select(h1b, -(WILLFUL_VIOLATOR))

# Assign Geo Code to each application based on EMPLOYER_POSTAL_CODE
# Many thanks to this blogpost by Kan Nishida & "zipcode" package designed by Jeffery Breen (https://blog.exploratory.io/geocoding-us-zip-code-data-with-dplyr-and-zipcode-package-7f539c3702b0)

# read in zipcode data & join two datasets

data(zipcode)
zipcode = select(zipcode,zip,latitude,longitude)
h1b = inner_join(h1b,zipcode,by=c("EMPLOYER_POSTAL_CODE"="zip"))

# remove entries where either "latitude" or "longitdue" is missing
h1b = filter(h1b, !is.na(latitude), !is.na(longitude))

# select only applications that are either certified or denied
h1b = h1b %>% 
    filter(CASE_STATUS == "CERTIFIED" | 
                     CASE_STATUS == "CERTIFIED-WITHDRAWN" | 
                     CASE_STATUS == "DENIED") %>% 
    mutate(CASE_STATUS = 
                         ifelse(CASE_STATUS=="DENIED","DENIED","CERTIFIED"))

# Create DEC_Y variable
h1b = mutate(h1b, DEC_YEAR = year(DECISION_DATE))

# Create employment length variable in terms of year
h1b = h1b %>% 
    mutate(EMP_LEN = as.numeric(
        (EMPLOYMENT_END_DATE - EMPLOYMENT_START_DATE)/365) %>%
    filter(EMP_LEN>0 & EMP_LEN <= 20)

# Fill out missing FULL_TIME_POSITION values
h1b = h1b %>% 
    mutate(FULL_TIME_POSITION = ifelse(is.na(FULL_TIME_POSITION),
                                       ifelse(PREVAILING_WAGE > 20000,2,1),
                                           FULL_TIME_POSITION))
h1b = h1b %>%
    filter(!(is.na(FULL_TIME_POSITION))) %>%
    mutate(FULL_TIME_POSITION = as.character(FULL_TIME_POSITION)) %>%
    mutate(FULL_TIME_POSITION = ifelse(FULL_TIME_POSITION=="1","N","Y"))

h1b$FULL_TIME_POSITION = factor(h1b$FULL_TIME_POSITION)

# Determine and reselect final variables 
h1b = h1b %>%
    select(CASE_STATUS,
           CASE_SUBMITTED,
           DECISION_DATE,
           APL_YEAR,
           DEC_YEAR,
           SOC_CODE,
           SOC_NAME,
           JOB_TITLE,
           NAIC_CODE,
           EMPLOYER_NAME,
           TOTAL_WORKERS,
           FULL_TIME_POSITION,
           PREVAILING_WAGE,
           EMP_LEN,
           WORKSITE_CITY,
           WORKSITE_STATE,
           latitude,
           longitude)

Exploratory Data Analysis:

Now let’s run some initial analysis to get a good sense of the dataset and also see if the initial proposals are feasible or not.

Exploratory Geo Plot

# get the map of United States
us_map = get_map(location='united states', zoom=4, maptype = "terrain",
             source='google',color='bw')

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=united+states&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=united%20states&sensor=false

Plot the geographic locations of the applications filed in 2011, 2014 & 2017

These are the plots of the GEO Locations of LCA applications filed respectively in 2011, 2014 and 2017. From these two plots, we can tell that most of the immigrant workers are located in both the east and west coasts as well as the midwestern regions of U.S, which aligns with our previous expectation, as large cities tend to be more concentrated in this area. However, these three plots are performing terribly as they rarely show any significant change in the spread of immigrant workers.

Let’s now zoom in on the locations and look at data in the midwestern area.

# create a list of midwestern states
midwest_list = c("IL",
                 "IN",
                 'IA',
                 'KS',
                 'MI',
                 'MN',
                 'MO',
                 'NE',
                 'ND',
                 'OH',
                 'SD',
                 'WI')
midwest_data = h1b %>% 
        filter(WORKSITE_STATE %in% midwest_list)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

These three plots show the GEO Locations of LCA applications filed respectively in 2011, 2014 and 2017 in the midwestern area. If we look at them separately, we can see that most of the applicants are concentrated in large states like Wisconsin, Illinois, Indiana and Michigan. More significantly, large cities like Dallas and Chicago have higher concentration of applicants compared to other areas.

Now let’s try plotting the applications in these three years altogether.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

We can tell from the plot that, although applicants are still concentrated in large cities like Chicago and Dallas throughout the years, they become more spread out in 2014 and 2017 than in 2011, given that 2014 and 2017 data have less large dots but more small dots scattering all over the map. It makes us wonder if this suggests that the geo-location of the job is becoming less important as a factor that determines the probability of certification over the years.

Exploratory Trend Plot

Now, let’s calculate the respective certification rates of each state in year of 2011, 2014 and 2017:

cert_rate_by_state = h1b %>%
                group_by(APL_YEAR,WORKSITE_STATE) %>%
                summarise(cert_no = sum(CASE_STATUS=="CERTIFIED"),
                            total_no = n(),cert_rate = cert_no/(total_no)) %>%
                filter(total_no >= 1000,APL_YEAR %in% c('2011','2014','2017')) %>%
                arrange(desc(cert_rate))
cert_rate_by_state

## # A tibble: 114 x 5
## # Groups:   APL_YEAR [3]
##    APL_YEAR WORKSITE_STATE cert_no total_no cert_rate
##      <fctr>         <fctr>   <int>    <int>     <dbl>
##  1     2017             DE    2523     2545 0.9913556
##  2     2017             RI    1856     1873 0.9909237
##  3     2017             AZ    7123     7194 0.9901307
##  4     2017             KY    2040     2061 0.9898108
##  5     2017             MO    6389     6457 0.9894688
##  6     2017             NH    1117     1129 0.9893711
##  7     2017             TN    5288     5346 0.9891508
##  8     2017             NJ   32329    32684 0.9891384
##  9     2017             MN    7918     8005 0.9891318
## 10     2017             WA   19031    19244 0.9889316
## # ... with 104 more rows

We can see from this result that, although the certifcation rates do not vary a lot from year to year, the top 10 certification rates are all in the year of 2017. Now, let’s plot the dataset to get a better visualization:

# 2011
cert_rate_by_state %>% filter(APL_YEAR=="2011") %>%
    ggplot(aes(x=WORKSITE_STATE,y=cert_rate)) +
    geom_point(color="red")

# 2014
cert_rate_by_state %>% filter(APL_YEAR=="2014") %>%
    ggplot(aes(x=WORKSITE_STATE,y=cert_rate)) +
    geom_point(color="blue")

# 2017
cert_rate_by_state %>% filter(APL_YEAR=="2017") %>%
    ggplot(aes(x=WORKSITE_STATE,y=cert_rate)) +
    geom_point(color="green")

According to the plots, the certification rates in 2011 are generally low. However, the certifcation rates gradually improve from 2011 to 2014 and 2014 to 2017, which suggests that the year of application might also be an important factor in determining the probability of an application being certified. Moreover, there are a lot of variations from state to state in terms of the certification rate in both 2011 and 2014; however, the variations decrease greatly in 2017, suggesting that the effect of geo-locations of the worksite on the certification rate is deteriorating.

Now let’s look at the APL_YEAR (application year) variable:

Figure 1 shows the number of applications filed as well as the total number of applicants in each year from 2011 to 2017. From the plot, we can easily tell that the total number of applicants is decreasing over the years while the number of applications is increasing over the years (the slight decrease in 2017 might be due to the fact that data is only collected until October).

Now let’s explore the relationship between TOTAL_WORKER and the certification rate:

## # A tibble: 7 x 5
##   APL_YEAR cert_no total_no cert_rate averge_total_worker
##     <fctr>   <int>    <int>     <dbl>               <dbl>
## 1     2017  475573   482498 0.9856476            1.926178
## 2     2016  581162   590293 0.9845314            1.954894
## 3     2015  558468   568615 0.9821549            2.153891
## 4     2014  469142   480810 0.9757326            2.073037
## 5     2013  395682   407721 0.9704725            2.172439
## 6     2012  284136   298392 0.9522239            4.897205
## 7     2011  264508   288635 0.9164100            5.956568

From the table and plot above, we can clearly see that the average TOTAL_WORKERS per application is decreasing, while the certification rate is increasing. Yet when we calculate the certification rate associated with TOTAL_WORKERS, the result indicates differently from above:

h1b %>%
    group_by(TOTAL_WORKERS) %>%
    summarise(cert_no = sum(CASE_STATUS=="CERTIFIED"),
                            total_no = n(),cert_rate = cert_no/(total_no)) %>%
                filter(total_no >= 1000) %>%
                arrange(desc(cert_rate))

## # A tibble: 18 x 4
##    TOTAL_WORKERS cert_no total_no cert_rate
##            <dbl>   <int>    <int>     <dbl>
##  1            20   18626    18835 0.9889036
##  2           100    1898     1924 0.9864865
##  3            15   36060    36567 0.9861350
##  4             9    1379     1399 0.9857041
##  5             5   44424    45144 0.9840510
##  6             6    9656     9814 0.9839006
##  7            30   16065    16338 0.9832905
##  8            10   38595    39288 0.9823610
##  9             8    2379     2422 0.9822461
## 10            40    1238     1262 0.9809826
## 11            12    1522     1556 0.9781491
## 12             7    2193     2246 0.9764025
## 13            25    9649     9888 0.9758293
## 14            50    6161     6320 0.9748418
## 15             1 2749310  2831398 0.9710080
## 16             3   26993    27809 0.9706570
## 17             4    9507     9799 0.9702010
## 18             2   46768    48531 0.9636727

The result shows that the correlation between TOTAL_WORKERS and the certification rate is not as significant as we expected, and thus, TOTAL_WORKERS might not be a statistically significant factor when predicting the certification rate.

Now let’s take a look at the FULL_TIME_POSITION variable:

h1b %>% 
    group_by(FULL_TIME_POSITION) %>%
    summarise(count=n(), per = 100*count/nrow(h1b))

## # A tibble: 3 x 3
##   FULL_TIME_POSITION   count       per
##               <fctr>   <int>     <dbl>
## 1                  N   74114  2.377762
## 2                  Y 2443232 78.384993
## 3               <NA>  599618 19.237245

Surprisingly almost 20% of the records have missing values regarding the Full Time Position. After further examination, I figure out that all records from the 2016 file have missing values. However, I think it is an important factor in determing the certification rate so I decide to fill in these values based on the relationship between position status and wage.

h1b %>% 
    group_by(FULL_TIME_POSITION) %>%
    summarise("avg wage" = mean(PREVAILING_WAGE,na.rm=T),
              '75%' = quantile(PREVAILING_WAGE,probs = 0.75,na.rm=TRUE),
              "25%" = quantile(PREVAILING_WAGE,probs = 0.25,na.rm=TRUE),
              "min" = min(PREVAILING_WAGE,na.rm=T),
              "max" = max(PREVAILING_WAGE,na.rm=T))

## # A tibble: 3 x 6
##   FULL_TIME_POSITION `avg wage`   `75%`    `25%`   min       max
##               <fctr>      <dbl>   <dbl>    <dbl> <dbl>     <dbl>
## 1                  N    102.665    33.5    21.46     0    163717
## 2                  Y  69980.215 82139.0 53518.00     0 820132347
## 3               <NA>  69812.137 83886.0 55661.00     0   4316000

Based on the statistics of the prevailing wage I then select 20000 as the cut off value for PART_TIME position:

h1b = h1b %>% 
    mutate(FULL_TIME_POSITION = ifelse(is.na(FULL_TIME_POSITION),
                                       ifelse(PREVAILING_WAGE > 20000,2,1),
                                           FULL_TIME_POSITION))
h1b = h1b %>%
    filter(!(is.na(FULL_TIME_POSITION))) %>%
    mutate(FULL_TIME_POSITION = as.character(FULL_TIME_POSITION)) %>%
    mutate(FULL_TIME_POSITION = ifelse(FULL_TIME_POSITION=="1","N","Y"))

h1b$FULL_TIME_POSITION = factor(h1b$FULL_TIME_POSITION)

h1b %>% 
    group_by(FULL_TIME_POSITION) %>%
    summarise("avg wage" = mean(PREVAILING_WAGE,na.rm=T),
              '75%' = quantile(PREVAILING_WAGE,probs = 0.75,na.rm=TRUE),
              "25%" = quantile(PREVAILING_WAGE,probs = 0.25,na.rm=TRUE),
              "min" = min(PREVAILING_WAGE,na.rm=T),
              "max" = max(PREVAILING_WAGE,na.rm=T))

## # A tibble: 2 x 6
##   FULL_TIME_POSITION `avg wage`    `75%`    `25%`   min       max
##               <fctr>      <dbl>    <dbl>    <dbl> <dbl>     <dbl>
## 1                  N   127.0514    35.16    22.08     0    163717
## 2                  Y 70795.9859 82763.00 54475.00     0 820132347

Regenerating the stat summary shows that the classification is reasonable, given that the figures did not change much.

Now let’s officially explore the relationship between the FULL_TIME_POSITION and the certification rate:

h1b %>%
    group_by(FULL_TIME_POSITION) %>%
    summarise(cert_no = sum(CASE_STATUS=="CERTIFIED"),
                            total_no = n(),cert_rate = cert_no/(total_no)) %>%
                filter(total_no >= 1000) %>%
                arrange(desc(cert_rate))

## # A tibble: 2 x 4
##   FULL_TIME_POSITION cert_no total_no cert_rate
##               <fctr>   <int>    <int>     <dbl>
## 1                  Y 2926382  3006273 0.9734252
## 2                  N  102289   110691 0.9240950

As expected, the certification rate of full-time position is much higher than that of part-time position.

The same conclusion can be reached if we look at the certification rate of full-time and part-time workers in each year from 2011 to 2017. However, the gap is also shrinking over the years from almost a 15% difference to only 3%, which suggests that the correlation between FULL_TIME_POSITION and the certification rate is deteriorating over the years.

Let’s now explore the EMP_LEN variable to see if there is a correlation between the length of employment and the certification rate:

# helper function to round up EMP_LEN
round2 = function(x, n) {
  posneg = sign(x)
  z = abs(x)*10^n
  z = z + 0.5
  z = trunc(z)
  z = z/10^n
  z*posneg
}

h1b %>%
    mutate(EMP_LEN_r = round2(as.numeric(EMP_LEN),0)) %>%
    group_by(EMP_LEN_r) %>%
    summarise(cert_no = sum(CASE_STATUS=="CERTIFIED"),
              total_no = n(),
              cert_rate = cert_no/(total_no)) %>%
    filter(total_no >= 1000) %>%
    arrange(desc(cert_rate))

## # A tibble: 4 x 4
##   EMP_LEN_r cert_no total_no cert_rate
##       <dbl>   <int>    <int>     <dbl>
## 1         3 2732655  2803915 0.9745855
## 2         1  108256   113581 0.9531172
## 3         0   19577    20574 0.9515408
## 4         2  168182   178880 0.9401945

The result can vaguely show the correlation between EMP_LEN and certification rate, as the certification rate of 2 years is even lower than that of 1 year and less than 1 year.

The only thing we can tell for sure from the graph is that applications with a proposed employment period of 3 years will have a higher probability of being certified. To see a more specific correlation, we will have to construct the logistic model.

Let’s now explore the PREVAILING_WAGE factor:

The graph basically align with our previous expectation that the certification rate increases as the proposed wage rate increases; however, further analysis is needed to determine a more precise correlation between these two variables.

Finally, let’s look at the relationship between the type of job and the certification rate:

Obviously, most of the top-10 positions are related to computer science and software development, which are trending jobs in recent years. However, some of the position trends end after 2014, which suggest that these positions probably are no longer in high demand.

Comparing this graph to the previous, we can see that there are three new jobs that are in top demand: Accountants, Financial Analysts and Management Analysts. Yet most of the jobs that are in high demand are still related to computer science and software develoment.

Final Question and Approach:

After the initial exploratory data analysis, I decide that I will construct an effective model to predict the certification rate of an H-1B LCA application based on selected variables. My approach will be mainly four-fold: 1. Construct two different models: i. Binary logistic regression model (statistical) ii. Decision tree model (machine learning) 2. Test the validity of each model using: i. Confusion Matrix ii. ROC Curve & AUC statistics 3. Compare the models and determine the better model for prediction. 4. Based on the selected model, determine the characteristic of a successful application.

Data Analysis:

data preparation

Because this dataset is too large to run efficiently, I decide to further trim down the dataset:

# remove NA values
h1b = h1b[complete.cases(h1b),]

# Take a smaller subset of the data
data_size = floor(0.25*nrow(h1b))
set.seed(1000)
data_id = sample(seq_len(nrow(h1b)), size = data_size)
h1b_s = h1b[data_id,]

# converting the dependent variable to binary
h1b_s$CASE_STATUS<-ifelse(h1b_s$CASE_STATUS=="CERTIFIED",1,0)

# select only major occupation
h1b_s$OCCUPATION<-NA
h1b_s$OCCUPATION[grep("engineer",h1b_s$SOC_NAME, ignore.case = T)]<-"ENGINEER"
h1b_s$OCCUPATION[grep("manager",h1b_s$SOC_NAME, ignore.case = T)]<-"MANAGER"
h1b_s$OCCUPATION[grep("technician",h1b_s$SOC_NAME, ignore.case = T)]<-"TECHNICIAN"
h1b_s$OCCUPATION[grep("teacher",h1b_s$SOC_NAME, ignore.case = T)]<-"TEACHER"
h1b_s$OCCUPATION[grep("executive",h1b_s$SOC_NAME, ignore.case = T)]<-"EXECUTIVE"
h1b_s$OCCUPATION[grep("accountant",h1b_s$SOC_NAME, ignore.case = T)]<-"ACCOUNTANT"
h1b_s$OCCUPATION[grep("analyst",h1b_s$SOC_NAME, ignore.case = T)]<-"ANALYST"
h1b_s$OCCUPATION[grep("develop",h1b_s$SOC_NAME, ignore.case = T)]<-"DEVELOPER"
h1b_s$OCCUPATION[grep("computer",h1b_s$SOC_NAME, ignore.case = T)]<-"COMPUTER OCCUPATIONS"
h1b_s = h1b_s %>% 
    mutate(OCCUPATION = ifelse(is.na(OCCUPATION),"OTHER",OCCUPATION))
h1b_s$OCCUPATION = factor(h1b_s$OCCUPATION)

# remove WORKSITE_STATE with too few variables
h1b_s = h1b_s %>%
    group_by(WORKSITE_STATE) %>%
    filter(n()>=1000)

# remove NA values
h1b_s = h1b_s[complete.cases(h1b_s),]

h1b_s %>%
    group_by(APL_YEAR) %>%
    summarise(n())

The dataset is now much easier to work with and the numbers of applications are approximately equal across the years. Let’s now dive into modeling section.

Logistic Regression

Before fitting the model, let’s first split the dataset into training and testing set:

# create training and testing dataset
smp_size = floor(0.7*nrow(h1b_s))
set.seed(1996)
train_id = sample(seq_len(nrow(h1b_s)), size = smp_size)
h1b_glm.train = h1b_s[train_id,]
h1b_glm.test = h1b_s[-train_id,]

Modeling

Now fit the training data to a logistic regression model:

# create formula
f1 = as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + 
        TOTAL_WORKERS + FULL_TIME_POSITION + PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + longitude + latitude
# logistic regression 
h1b_glm_m <-glm(f1, family=binomial(link = logit), data = h1b_glm.train)
summary(h1b_glm_m)

## 
## Call:
## glm(formula = f1, family = binomial(link = logit), data = h1b_glm.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2804   0.1440   0.1818   0.2551   1.1363  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     7.431e-02  1.648e-01   0.451 0.652055    
## APL_YEAR2012                    5.758e-01  2.685e-02  21.447  < 2e-16 ***
## APL_YEAR2013                    9.361e-01  2.797e-02  33.469  < 2e-16 ***
## APL_YEAR2014                    1.105e+00  2.821e-02  39.155  < 2e-16 ***
## APL_YEAR2015                    1.351e+00  2.924e-02  46.188  < 2e-16 ***
## APL_YEAR2016                    1.535e+00  3.051e-02  50.320  < 2e-16 ***
## APL_YEAR2017                    1.643e+00  3.414e-02  48.131  < 2e-16 ***
## OCCUPATIONANALYST               4.560e-01  5.195e-02   8.779  < 2e-16 ***
## OCCUPATIONCOMPUTER OCCUPATIONS  1.220e+00  4.719e-02  25.862  < 2e-16 ***
## OCCUPATIONDEVELOPER             9.566e-01  5.027e-02  19.031  < 2e-16 ***
## OCCUPATIONENGINEER              4.370e-01  5.341e-02   8.183 2.77e-16 ***
## OCCUPATIONEXECUTIVE            -8.809e-01  1.136e-01  -7.751 9.09e-15 ***
## OCCUPATIONMANAGER              -1.912e-01  5.317e-02  -3.596 0.000323 ***
## OCCUPATIONOTHER                 1.037e-01  4.641e-02   2.235 0.025393 *  
## OCCUPATIONTEACHER               2.887e-01  5.889e-02   4.902 9.47e-07 ***
## OCCUPATIONTECHNICIAN           -7.681e-02  9.029e-02  -0.851 0.394942    
## TOTAL_WORKERS                   2.971e-03  1.723e-03   1.724 0.084682 .  
## FULL_TIME_POSITIONY             5.991e-01  3.045e-02  19.671  < 2e-16 ***
## PREVAILING_WAGE                -1.098e-07  5.759e-08  -1.907 0.056511 .  
## EMP_LEN                         2.656e-01  1.331e-02  19.950  < 2e-16 ***
## WORKSITE_STATEAR                3.794e-01  1.707e-01   2.222 0.026252 *  
## WORKSITE_STATEAZ                4.705e-01  1.307e-01   3.601 0.000318 ***
## WORKSITE_STATECA                2.694e-01  1.083e-01   2.489 0.012812 *  
## WORKSITE_STATECO                2.043e-02  1.278e-01   0.160 0.872981    
## WORKSITE_STATECT                7.191e-01  1.434e-01   5.016 5.29e-07 ***
## WORKSITE_STATEDC               -1.332e-01  1.281e-01  -1.040 0.298561    
## WORKSITE_STATEDE                5.201e-01  1.695e-01   3.069 0.002150 ** 
## WORKSITE_STATEFL                1.063e-02  1.107e-01   0.096 0.923451    
## WORKSITE_STATEGA                2.485e-01  1.146e-01   2.169 0.030092 *  
## WORKSITE_STATEIA                7.713e-01  1.709e-01   4.514 6.37e-06 ***
## WORKSITE_STATEID               -9.570e-02  2.051e-01  -0.467 0.640819    
## WORKSITE_STATEIL                4.136e-01  1.129e-01   3.663 0.000250 ***
## WORKSITE_STATEIN                3.466e-01  1.336e-01   2.595 0.009469 ** 
## WORKSITE_STATEKS                2.719e-01  1.598e-01   1.702 0.088723 .  
## WORKSITE_STATEKY                2.605e-01  1.585e-01   1.644 0.100142    
## WORKSITE_STATELA               -5.186e-02  1.450e-01  -0.358 0.720614    
## WORKSITE_STATEMA                3.609e-01  1.181e-01   3.057 0.002237 ** 
## WORKSITE_STATEMD                2.322e-01  1.184e-01   1.961 0.049828 *  
## WORKSITE_STATEMI                3.815e-01  1.192e-01   3.200 0.001374 ** 
## WORKSITE_STATEMN                1.586e-01  1.254e-01   1.265 0.205966    
## WORKSITE_STATEMO                5.090e-01  1.369e-01   3.717 0.000201 ***
## WORKSITE_STATEMS               -1.760e-02  1.889e-01  -0.093 0.925759    
## WORKSITE_STATENC                3.248e-01  1.187e-01   2.737 0.006195 ** 
## WORKSITE_STATENE               -1.167e-02  1.684e-01  -0.069 0.944776    
## WORKSITE_STATENH                5.739e-01  2.449e-01   2.344 0.019094 *  
## WORKSITE_STATENJ                4.558e-01  1.144e-01   3.986 6.71e-05 ***
## WORKSITE_STATENM                2.211e-01  1.906e-01   1.160 0.246160    
## WORKSITE_STATENV               -4.103e-01  1.539e-01  -2.665 0.007697 ** 
## WORKSITE_STATENY                9.470e-02  1.083e-01   0.875 0.381753    
## WORKSITE_STATEOH                3.540e-01  1.196e-01   2.960 0.003075 ** 
## WORKSITE_STATEOK                2.680e-01  1.617e-01   1.657 0.097573 .  
## WORKSITE_STATEOR                3.379e-01  1.433e-01   2.359 0.018322 *  
## WORKSITE_STATEPA                4.731e-01  1.163e-01   4.068 4.75e-05 ***
## WORKSITE_STATERI                6.196e-01  2.183e-01   2.838 0.004534 ** 
## WORKSITE_STATESC               -3.180e-02  1.479e-01  -0.215 0.829731    
## WORKSITE_STATETN                1.230e-01  1.303e-01   0.944 0.345151    
## WORKSITE_STATETX                3.362e-01  1.081e-01   3.111 0.001864 ** 
## WORKSITE_STATEUT               -3.131e-02  1.476e-01  -0.212 0.832007    
## WORKSITE_STATEVA                2.623e-01  1.162e-01   2.258 0.023967 *  
## WORKSITE_STATEWA                5.366e-01  1.222e-01   4.391 1.13e-05 ***
## WORKSITE_STATEWI                5.185e-01  1.414e-01   3.667 0.000246 ***
## longitude                       3.408e-03  7.158e-04   4.761 1.93e-06 ***
## latitude                        1.405e-02  2.505e-03   5.608 2.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139318  on 539147  degrees of freedom
## Residual deviance: 126964  on 539085  degrees of freedom
## AIC: 127090
## 
## Number of Fisher Scoring iterations: 7

From the output above, we can see that the residual variance is much smaller than the null deviance, which suggests that the predictive power of this model is much better than that of a null model that only includes the intercept. While in terms of the predictors, most of them are statistical significant except TOTAL_WORKERS (the number of foreign workers per application), which aligns with my previous suspicion that its correlation to the certification probability is not significant in EDA part. I will try rebuilding the model without TOTAL_WORKERS and see if it affects the model’s predictive power significantly:

h1b_glm_m2 = update(h1b_glm_m,~.-(TOTAL_WORKERS))
anova(h1b_glm_m,h1b_glm_m2,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + TOTAL_WORKERS + 
##     FULL_TIME_POSITION + PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + 
##     longitude + latitude
## Model 2: as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + FULL_TIME_POSITION + 
##     PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + longitude + 
##     latitude
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1    539085     126964                       
## 2    539086     126967 -1  -3.2143    0.073 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance only increases by 3 and the p-value is larger than 0.05, both indicating that the difference in model fit is not statistically significant, and thus, I will accept the new model.

Evaluating the performance of the model fitted

# generate predictions
pred_glm = predict(h1b_glm_m2,newdata = h1b_glm.test,type="response")

t = h1b_s %>% 
        group_by(APL_YEAR) %>%
        summarise(cert_no = sum(CASE_STATUS==1),
                  total_no = n(),
                  cert_rate = cert_no/total_no)
mean(t$cert_rate)

## [1] 0.9668627

Based on the average certification rate of the dataset, I choose “0.97” to be our certification cutoff value:

# generate 0 or 1 based on the predicted probability
pred_glm_c = ifelse(pred_glm>=0.97,1,0)

confusion matrix

# generate confusion matrix
glm_cf = confusionMatrix(pred_glm_c,h1b_glm.test$CASE_STATUS, positive = "1")
glm_cf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0   4133  63284
##          1   2250 161397
##                                           
##                Accuracy : 0.7164          
##                  95% CI : (0.7145, 0.7182)
##     No Information Rate : 0.9724          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0648          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.71834         
##             Specificity : 0.64750         
##          Pos Pred Value : 0.98625         
##          Neg Pred Value : 0.06131         
##              Prevalence : 0.97238         
##          Detection Rate : 0.69849         
##    Detection Prevalence : 0.70823         
##       Balanced Accuracy : 0.68292         
##                                           
##        'Positive' Class : 1               
##

According to the confusion matrix output table, the accuracy of the logistic regression model is only 72% with an error rate of 28%. It might appear okay at first; however, the high no information rate (97.24%) suggests that an accuracy of 97.24% can be achieved if we just randomly assign “CERTIFIED” or “DENIED” value to a case. Unfortunately, it might suggest that our model is completely useless in predicting the certification probability of a LCA application. Even so, the model achieves a relatively fair performance in correctly identifying 65% of the denied cases according to the specitivity rate (the proportion of negatives that are correctly identified), although the sensitivity rate (the proportion of positivies that are correctly identified) is relatively low in this case.

Receiving Operator Characteristic (ROC) Curves

Let’s also plot the ROC curves to further assess the performance of the logistic regression model:

# ROC curve
pred_glm_r = prediction(pred_glm,h1b_glm.test$CASE_STATUS)
glm_roc = performance(pred_glm_r,"tpr","fpr")
plot(glm_roc)

# area under the curve
glm.auc = as.numeric(performance(pred_glm_r,"auc")@y.values)
glm.auc

## [1] 0.7388512

The ROC curve plot the true positive rate (the proportion of true postive cases identified as such) against the negative positive rate (the proportion of negative cases incorrectly identified as positive). The area under the curve measures the accuracy of the model, which is 74.3% in this case. After all, the ROC curve and the area under the curve suggest that the model is relatively fair rather than being completely worthless, which would generate a 45 degree curve with auc equals to 50%.

Next I decide to try fitting a decision tree model to the data to see if the performance will be improved.

Decision Tree

Before fitting the model, let’s first split the dataset into training and testing set:

# create training and testing dataset
smp_size = floor(0.7*nrow(h1b_s))
set.seed(1996)
train_id = sample(seq_len(nrow(h1b_s)), size = smp_size)
h1b_dt.train = h1b_s[train_id,]
h1b_dt.test = h1b_s[-train_id,]

Modeling:

# create formula:
f2 = as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + 
            TOTAL_WORKERS + FULL_TIME_POSITION + PREVAILING_WAGE + as.numeric(EMP_LEN) + WORKSITE_STATE + longitude + latitude

# model
h1b_dt_m = ctree(f2, data=h1b_dt.train)

Testing the performance of the model

Now let’s run the decision tree model on our testing dataset

# generate predictions
pred_dt = predict(h1b_dt_m, h1b_dt.test)

# generate confusion matrix
dt_cf = confusionMatrix(pred_dt,h1b_dt.test$CASE_STATUS,positive = "1")
dt_cf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0      9     20
##          1   6374 224661
##                                          
##                Accuracy : 0.9723         
##                  95% CI : (0.9717, 0.973)
##     No Information Rate : 0.9724         
##     P-Value [Acc > NIR] : 0.5588         
##                                          
##                   Kappa : 0.0026         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.99991        
##             Specificity : 0.00141        
##          Pos Pred Value : 0.97241        
##          Neg Pred Value : 0.31034        
##              Prevalence : 0.97238        
##          Detection Rate : 0.97229        
##    Detection Prevalence : 0.99987        
##       Balanced Accuracy : 0.50066        
##                                          
##        'Positive' Class : 1              
##

Apparently, the decision tree model greatly improves the accuary from 70% to 97.2% in this case, however, its performance is the same as that of the no information classification. Moreover, the model achieves a high sensitivity rate of nearly 100%, but achieves a very low specificity rate of almost 0%. Given that the original data has a high percentage of positive cases, I will concentrate more on how effectively can the model identify the negative cases, the denied applications. Therefore, I cannot conclude that the decision tree model performs much better than the logistic regression one even with a high accuracy rate.

Receiving Operator Characteristic (ROC) Curves

Let’s also plot the ROC curves to further assess the performance of the decision tree model:

# generate probability
dt_prob = 1 - unlist(treeresponse(h1b_dt_m,h1b_dt.test), 
                     use.names=F)[seq(1,nrow(h1b_dt.test)*2,2)]
# ROC curve
pred_dt_r = prediction(dt_prob,h1b_dt.test$CASE_STATUS)
dt_roc = performance(pred_dt_r,"tpr","fpr")
plot(dt_roc)

# area under the curve
dt.auc = as.numeric(performance(pred_dt_r,"auc")@y.values)
dt.auc

## [1] 0.7451939

In this case, the ROC curve and the area under the curve also suggest that the model is relatively fair rather than being completely worthless. It also performs slightly better than the logistic regression model with auc increases from 74.3% to 74.9%.

Imbalanced Sample Data Correction

However, I think that neither of the models possess a significant predictive power and conclude that it is probably due to the fact that our dataset is highly skewed with a majority of the cases are certified and only a few of them are denied after further researching. With imbalanced dataset, the model does not receive necessary information about the minority class to make an accurate prediciton. Therefore, I decide to further transform h1b_s dataset to a balanced one using Synthetic Data Generation.

# check number of cases
table(h1b_s$CASE_STATUS)

Apparently the number of certified cases is almost 35 times more than that of denied cases.

# balance dataset
h1b_sb = h1b_s %>%
    select(CASE_STATUS,APL_YEAR,OCCUPATION, 
        TOTAL_WORKERS,FULL_TIME_POSITION,PREVAILING_WAGE,EMP_LEN,
            WORKSITE_STATE,longitude,latitude) %>%
    mutate(EMP_LEN = as.numeric(EMP_LEN))

h1b_b = ROSE(CASE_STATUS ~ ., data=h1b_sb, seed = 1)$data

# check the result
table(h1b_b$CASE_STATUS)
prop.table(table(h1b_b$CASE_STATUS))

As shown above, I now have a much more balanced dataset with the proportion of certified and denied cases to be 50-50.

Now I rebuild both of the logistic regression and conditonal tree model to see if there is any improvement.

Logistic Regression (NEW)

# create training and testing dataset
smp_size = floor(0.7*nrow(h1b_b))
set.seed(1996)
train_id = sample(seq_len(nrow(h1b_b)), size = smp_size)
h1b_glmb.train = h1b_b[train_id,]
h1b_glmb.test = h1b_b[-train_id,]

Modeling

Now fit the new training data to a logistic regression model

# create formula
f1 = as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + 
        TOTAL_WORKERS + FULL_TIME_POSITION + PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + longitude + latitude
# logistic regression 
h1b_glm_mn <-glm(f1, family=binomial(link = logit), data = h1b_glmb.train)
summary(h1b_glm_mn)

## 
## Call:
## glm(formula = f1, family = binomial(link = logit), data = h1b_glmb.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0062  -1.0004  -0.3285   0.9599   2.6393  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.355e+00  6.421e-02 -52.248  < 2e-16 ***
## APL_YEAR2012                    5.818e-01  1.143e-02  50.910  < 2e-16 ***
## APL_YEAR2013                    9.448e-01  1.115e-02  84.700  < 2e-16 ***
## APL_YEAR2014                    1.102e+00  1.096e-02 100.488  < 2e-16 ***
## APL_YEAR2015                    1.375e+00  1.095e-02 125.524  < 2e-16 ***
## APL_YEAR2016                    1.563e+00  1.114e-02 140.355  < 2e-16 ***
## APL_YEAR2017                    1.638e+00  1.188e-02 137.854  < 2e-16 ***
## OCCUPATIONANALYST               4.716e-01  2.087e-02  22.595  < 2e-16 ***
## OCCUPATIONCOMPUTER OCCUPATIONS  1.175e+00  1.893e-02  62.097  < 2e-16 ***
## OCCUPATIONDEVELOPER             9.193e-01  1.968e-02  46.711  < 2e-16 ***
## OCCUPATIONENGINEER              3.871e-01  2.140e-02  18.093  < 2e-16 ***
## OCCUPATIONEXECUTIVE            -9.208e-01  6.346e-02 -14.511  < 2e-16 ***
## OCCUPATIONMANAGER              -2.068e-01  2.284e-02  -9.052  < 2e-16 ***
## OCCUPATIONOTHER                 3.361e-02  1.927e-02   1.744 0.081174 .  
## OCCUPATIONTEACHER               1.918e-01  2.422e-02   7.916 2.45e-15 ***
## OCCUPATIONTECHNICIAN           -2.005e-01  3.955e-02  -5.069 4.00e-07 ***
## TOTAL_WORKERS                  -1.369e-03  2.447e-04  -5.595 2.20e-08 ***
## FULL_TIME_POSITIONY             6.467e-01  1.368e-02  47.259  < 2e-16 ***
## PREVAILING_WAGE                -3.565e-08  1.893e-08  -1.883 0.059668 .  
## EMP_LEN                         3.005e-01  5.542e-03  54.219  < 2e-16 ***
## WORKSITE_STATEAR                6.007e-01  6.477e-02   9.275  < 2e-16 ***
## WORKSITE_STATEAZ                5.166e-01  5.157e-02  10.017  < 2e-16 ***
## WORKSITE_STATECA                3.071e-01  4.524e-02   6.789 1.13e-11 ***
## WORKSITE_STATECO                1.810e-01  5.264e-02   3.439 0.000584 ***
## WORKSITE_STATECT                6.376e-01  5.301e-02  12.028  < 2e-16 ***
## WORKSITE_STATEDC               -1.194e-01  5.370e-02  -2.223 0.026188 *  
## WORKSITE_STATEDE                5.315e-01  6.081e-02   8.741  < 2e-16 ***
## WORKSITE_STATEFL                1.585e-02  4.658e-02   0.340 0.733680    
## WORKSITE_STATEGA                2.431e-01  4.709e-02   5.162 2.44e-07 ***
## WORKSITE_STATEIA                5.318e-01  6.188e-02   8.594  < 2e-16 ***
## WORKSITE_STATEID                1.086e-01  8.406e-02   1.292 0.196510    
## WORKSITE_STATEIL                4.201e-01  4.652e-02   9.030  < 2e-16 ***
## WORKSITE_STATEIN                4.570e-01  5.328e-02   8.577  < 2e-16 ***
## WORKSITE_STATEKS                2.291e-01  6.076e-02   3.771 0.000163 ***
## WORKSITE_STATEKY                4.121e-01  6.277e-02   6.565 5.19e-11 ***
## WORKSITE_STATELA               -7.602e-03  6.118e-02  -0.124 0.901110    
## WORKSITE_STATEMA                3.315e-01  4.760e-02   6.965 3.28e-12 ***
## WORKSITE_STATEMD                2.158e-01  4.898e-02   4.405 1.06e-05 ***
## WORKSITE_STATEMI                4.018e-01  4.830e-02   8.318  < 2e-16 ***
## WORKSITE_STATEMN                2.324e-01  5.052e-02   4.600 4.22e-06 ***
## WORKSITE_STATEMO                5.604e-01  5.326e-02  10.521  < 2e-16 ***
## WORKSITE_STATEMS                4.884e-02  8.244e-02   0.592 0.553623    
## WORKSITE_STATENC                4.035e-01  4.816e-02   8.378  < 2e-16 ***
## WORKSITE_STATENE               -2.440e-02  6.754e-02  -0.361 0.717945    
## WORKSITE_STATENH                4.930e-01  7.928e-02   6.218 5.03e-10 ***
## WORKSITE_STATENJ                4.912e-01  4.650e-02  10.563  < 2e-16 ***
## WORKSITE_STATENM                1.995e-01  7.501e-02   2.659 0.007835 ** 
## WORKSITE_STATENV               -3.484e-01  6.635e-02  -5.250 1.52e-07 ***
## WORKSITE_STATENY                1.153e-01  4.544e-02   2.539 0.011132 *  
## WORKSITE_STATEOH                3.195e-01  4.839e-02   6.603 4.03e-11 ***
## WORKSITE_STATEOK                2.761e-01  6.421e-02   4.299 1.71e-05 ***
## WORKSITE_STATEOR                4.543e-01  5.619e-02   8.085 6.20e-16 ***
## WORKSITE_STATEPA                4.739e-01  4.741e-02   9.995  < 2e-16 ***
## WORKSITE_STATERI                6.305e-01  7.083e-02   8.902  < 2e-16 ***
## WORKSITE_STATESC               -3.883e-03  6.019e-02  -0.065 0.948555    
## WORKSITE_STATETN                1.681e-01  5.297e-02   3.172 0.001512 ** 
## WORKSITE_STATETX                3.388e-01  4.523e-02   7.490 6.88e-14 ***
## WORKSITE_STATEUT                1.450e-02  6.055e-02   0.239 0.810805    
## WORKSITE_STATEVA                2.791e-01  4.758e-02   5.865 4.49e-09 ***
## WORKSITE_STATEWA                5.598e-01  4.872e-02  11.491  < 2e-16 ***
## WORKSITE_STATEWI                5.806e-01  5.418e-02  10.716  < 2e-16 ***
## longitude                       3.070e-03  2.364e-04  12.987  < 2e-16 ***
## latitude                        7.164e-03  8.535e-04   8.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747417  on 539147  degrees of freedom
## Residual deviance: 648108  on 539085  degrees of freedom
## AIC: 648234
## 
## Number of Fisher Scoring iterations: 4

From the output above, we can see that the residual variance is much smaller than the null deviance, which suggests that the predictive power of this model is still much better than that of a null model that only includes the intercept. While in terms of the predictors, PREVAILING_WAGE instead of TOTAL_WORKERS is rendered statistically insignificant in this case. I will try rebuilding the model without PREVAILING_WAGE and see if it affects the model’s predictive power significantly:

h1b_glm_mn2 = update(h1b_glm_mn,~.-(PREVAILING_WAGE))
anova(h1b_glm_mn,h1b_glm_mn2,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + TOTAL_WORKERS + 
##     FULL_TIME_POSITION + PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + 
##     longitude + latitude
## Model 2: as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + TOTAL_WORKERS + 
##     FULL_TIME_POSITION + EMP_LEN + WORKSITE_STATE + longitude + 
##     latitude
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1    539085     648108                       
## 2    539086     648112 -1   -4.133  0.04205 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance only increases by 4, however, the p-value is smaller than 0.05, indicating that the difference in model fit is statistically significant, and thus, I will keep the original model.

Evaluating the performance of the model fitted

# generate predictions
pred_glmn = predict(h1b_glm_mn,newdata = h1b_glmb.test,type="response")

t = h1b_b %>% 
        group_by(APL_YEAR) %>%
        summarise(cert_no = sum(CASE_STATUS==1),
                  total_no = n(),
                  cert_rate = cert_no/total_no)
mean(t$cert_rate)

## [1] 0.5093508

Based on the average certification rate of the dataset, I choose “0.52” to be our certification cutoff value:

# generate 0 or 1 based on the predicted probability
pred_glmn_c = ifelse(pred_glmn>=0.52,1,0)

confusion matrix

# generate confusion matrix
glm_cfn = confusionMatrix(pred_glmn_c,h1b_glmb.test$CASE_STATUS, positive = "1")
glm_cfn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 79560 37067
##          1 35954 78483
##                                           
##                Accuracy : 0.684           
##                  95% CI : (0.6821, 0.6859)
##     No Information Rate : 0.5001          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.368           
##  Mcnemar's Test P-Value : 3.87e-05        
##                                           
##             Sensitivity : 0.6792          
##             Specificity : 0.6887          
##          Pos Pred Value : 0.6858          
##          Neg Pred Value : 0.6822          
##              Prevalence : 0.5001          
##          Detection Rate : 0.3397          
##    Detection Prevalence : 0.4953          
##       Balanced Accuracy : 0.6840          
##                                           
##        'Positive' Class : 1               
##

According to the confusion matrix output table, the accuracy of the new logistic regression model is 68.4%, which is less than that of the original model. However, if we look at the no information rate, which is only 50%, so I decide that this model indeed provides useful insight and actually possesses more predictive power in reality. Moreover, the model achieves a relatively fair performance in correctly identifying 69% of the denied cases and 68% of the certified cases, which are much higher than the original model’s rates.

Receiving Operator Characteristic (ROC) Curves

Let’s also plot the ROC curves to further assess the performance of the new logistic regression model:

# ROC curve
pred_glm_rn = prediction(pred_glmn,h1b_glmb.test$CASE_STATUS)
glm_rocn = performance(pred_glm_rn,"tpr","fpr")
plot(glm_rocn)

# area under the curve
glm.aucn = as.numeric(performance(pred_glm_rn,"auc")@y.values)
glm.aucn

## [1] 0.7412834

The accuracy is 74.1% in this case, which suggests that the model is performing fairly good in this case.

Next I decide to try fitting a new decision tree model to the data to see if the performance will also be improved.

Decision Tree (NEW)

Before fitting the model, let’s first split the dataset into training and testing set

# create training and testing dataset
smp_size = floor(0.7*nrow(h1b_b))
set.seed(1996)
train_id = sample(seq_len(nrow(h1b_b)), size = smp_size)
h1b_dtn.train = h1b_b[train_id,]
h1b_dtn.test = h1b_b[-train_id,]

Modeling:

# create formula:
f2 = as.factor(CASE_STATUS) ~ APL_YEAR + OCCUPATION + 
            TOTAL_WORKERS + FULL_TIME_POSITION + PREVAILING_WAGE + EMP_LEN + WORKSITE_STATE + longitude + latitude

# model
h1b_dt_mn = ctree(f2, data=h1b_dtn.train)

Testing the performance of the model

Now let’s run the decision tree model on our testing dataset

# generate predictions
pred_dtn = predict(h1b_dt_mn, h1b_dtn.test)

# generate confusion matrix
dt_cfn = confusionMatrix(pred_dtn,h1b_dtn.test$CASE_STATUS,positive = "1")
dt_cfn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 92774 27659
##          1 22740 87891
##                                           
##                Accuracy : 0.7819          
##                  95% CI : (0.7802, 0.7836)
##     No Information Rate : 0.5001          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5638          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7606          
##             Specificity : 0.8031          
##          Pos Pred Value : 0.7945          
##          Neg Pred Value : 0.7703          
##              Prevalence : 0.5001          
##          Detection Rate : 0.3804          
##    Detection Prevalence : 0.4788          
##       Balanced Accuracy : 0.7819          
##                                           
##        'Positive' Class : 1               
##

The accuracy of the new decision tree model also decreases to only 80%, and yet the no information rate is only 50% in this case. This suggests that the new decision tree model possesses more predictive power in reality and also outperforms the new logistic regression model, whose accuracy is only 74%. Moreover, the model also achieves a high specificity rate of almost 80% and a pretty good sensitivity rate of 76%, which indicates that the model can identify both certified and denied applications effectively.

Receiving Operator Characteristic (ROC) Curves

Let’s also plot the ROC curves to further assess the performance of the new decision tree model:

# generate probability
dt_probn = 1 - unlist(treeresponse(h1b_dt_mn,h1b_dtn.test), 
                     use.names=F)[seq(1,nrow(h1b_dtn.test)*2,2)]
# ROC curve
pred_dt_rn = prediction(dt_probn,h1b_dtn.test$CASE_STATUS)
dt_rocn = performance(pred_dt_rn,"tpr","fpr")
plot(dt_rocn)

# area under the curve
dt.aucn = as.numeric(performance(pred_dt_rn,"auc")@y.values)
dt.aucn

## [1] 0.8642801

The ROC curve also corroborates the previous finding that the new decision tree model outperforms both the old decision tree model and the new logistic regression model. The high AUC of 0.864 also indicates the the performance of this new model is very good.

Model Comparison

From the plot above, we can easily tell that the new decision model achieves a much better performance in terms of predictions and classifications than the other three models. The area under its associated roc curve is also much higher than those of the other three curves.

Variable Importance

To determine the important criteria of a successful LCA application, I will use the new logistic regression model because its performance of classification and identification is fairly good and it is easier to work with when looking at each predictor separately.

summary(h1b_glm_mn)

## 
## Call:
## glm(formula = f1, family = binomial(link = logit), data = h1b_glmb.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0062  -1.0004  -0.3285   0.9599   2.6393  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.355e+00  6.421e-02 -52.248  < 2e-16 ***
## APL_YEAR2012                    5.818e-01  1.143e-02  50.910  < 2e-16 ***
## APL_YEAR2013                    9.448e-01  1.115e-02  84.700  < 2e-16 ***
## APL_YEAR2014                    1.102e+00  1.096e-02 100.488  < 2e-16 ***
## APL_YEAR2015                    1.375e+00  1.095e-02 125.524  < 2e-16 ***
## APL_YEAR2016                    1.563e+00  1.114e-02 140.355  < 2e-16 ***
## APL_YEAR2017                    1.638e+00  1.188e-02 137.854  < 2e-16 ***
## OCCUPATIONANALYST               4.716e-01  2.087e-02  22.595  < 2e-16 ***
## OCCUPATIONCOMPUTER OCCUPATIONS  1.175e+00  1.893e-02  62.097  < 2e-16 ***
## OCCUPATIONDEVELOPER             9.193e-01  1.968e-02  46.711  < 2e-16 ***
## OCCUPATIONENGINEER              3.871e-01  2.140e-02  18.093  < 2e-16 ***
## OCCUPATIONEXECUTIVE            -9.208e-01  6.346e-02 -14.511  < 2e-16 ***
## OCCUPATIONMANAGER              -2.068e-01  2.284e-02  -9.052  < 2e-16 ***
## OCCUPATIONOTHER                 3.361e-02  1.927e-02   1.744 0.081174 .  
## OCCUPATIONTEACHER               1.918e-01  2.422e-02   7.916 2.45e-15 ***
## OCCUPATIONTECHNICIAN           -2.005e-01  3.955e-02  -5.069 4.00e-07 ***
## TOTAL_WORKERS                  -1.369e-03  2.447e-04  -5.595 2.20e-08 ***
## FULL_TIME_POSITIONY             6.467e-01  1.368e-02  47.259  < 2e-16 ***
## PREVAILING_WAGE                -3.565e-08  1.893e-08  -1.883 0.059668 .  
## EMP_LEN                         3.005e-01  5.542e-03  54.219  < 2e-16 ***
## WORKSITE_STATEAR                6.007e-01  6.477e-02   9.275  < 2e-16 ***
## WORKSITE_STATEAZ                5.166e-01  5.157e-02  10.017  < 2e-16 ***
## WORKSITE_STATECA                3.071e-01  4.524e-02   6.789 1.13e-11 ***
## WORKSITE_STATECO                1.810e-01  5.264e-02   3.439 0.000584 ***
## WORKSITE_STATECT                6.376e-01  5.301e-02  12.028  < 2e-16 ***
## WORKSITE_STATEDC               -1.194e-01  5.370e-02  -2.223 0.026188 *  
## WORKSITE_STATEDE                5.315e-01  6.081e-02   8.741  < 2e-16 ***
## WORKSITE_STATEFL                1.585e-02  4.658e-02   0.340 0.733680    
## WORKSITE_STATEGA                2.431e-01  4.709e-02   5.162 2.44e-07 ***
## WORKSITE_STATEIA                5.318e-01  6.188e-02   8.594  < 2e-16 ***
## WORKSITE_STATEID                1.086e-01  8.406e-02   1.292 0.196510    
## WORKSITE_STATEIL                4.201e-01  4.652e-02   9.030  < 2e-16 ***
## WORKSITE_STATEIN                4.570e-01  5.328e-02   8.577  < 2e-16 ***
## WORKSITE_STATEKS                2.291e-01  6.076e-02   3.771 0.000163 ***
## WORKSITE_STATEKY                4.121e-01  6.277e-02   6.565 5.19e-11 ***
## WORKSITE_STATELA               -7.602e-03  6.118e-02  -0.124 0.901110    
## WORKSITE_STATEMA                3.315e-01  4.760e-02   6.965 3.28e-12 ***
## WORKSITE_STATEMD                2.158e-01  4.898e-02   4.405 1.06e-05 ***
## WORKSITE_STATEMI                4.018e-01  4.830e-02   8.318  < 2e-16 ***
## WORKSITE_STATEMN                2.324e-01  5.052e-02   4.600 4.22e-06 ***
## WORKSITE_STATEMO                5.604e-01  5.326e-02  10.521  < 2e-16 ***
## WORKSITE_STATEMS                4.884e-02  8.244e-02   0.592 0.553623    
## WORKSITE_STATENC                4.035e-01  4.816e-02   8.378  < 2e-16 ***
## WORKSITE_STATENE               -2.440e-02  6.754e-02  -0.361 0.717945    
## WORKSITE_STATENH                4.930e-01  7.928e-02   6.218 5.03e-10 ***
## WORKSITE_STATENJ                4.912e-01  4.650e-02  10.563  < 2e-16 ***
## WORKSITE_STATENM                1.995e-01  7.501e-02   2.659 0.007835 ** 
## WORKSITE_STATENV               -3.484e-01  6.635e-02  -5.250 1.52e-07 ***
## WORKSITE_STATENY                1.153e-01  4.544e-02   2.539 0.011132 *  
## WORKSITE_STATEOH                3.195e-01  4.839e-02   6.603 4.03e-11 ***
## WORKSITE_STATEOK                2.761e-01  6.421e-02   4.299 1.71e-05 ***
## WORKSITE_STATEOR                4.543e-01  5.619e-02   8.085 6.20e-16 ***
## WORKSITE_STATEPA                4.739e-01  4.741e-02   9.995  < 2e-16 ***
## WORKSITE_STATERI                6.305e-01  7.083e-02   8.902  < 2e-16 ***
## WORKSITE_STATESC               -3.883e-03  6.019e-02  -0.065 0.948555    
## WORKSITE_STATETN                1.681e-01  5.297e-02   3.172 0.001512 ** 
## WORKSITE_STATETX                3.388e-01  4.523e-02   7.490 6.88e-14 ***
## WORKSITE_STATEUT                1.450e-02  6.055e-02   0.239 0.810805    
## WORKSITE_STATEVA                2.791e-01  4.758e-02   5.865 4.49e-09 ***
## WORKSITE_STATEWA                5.598e-01  4.872e-02  11.491  < 2e-16 ***
## WORKSITE_STATEWI                5.806e-01  5.418e-02  10.716  < 2e-16 ***
## longitude                       3.070e-03  2.364e-04  12.987  < 2e-16 ***
## latitude                        7.164e-03  8.535e-04   8.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747417  on 539147  degrees of freedom
## Residual deviance: 648108  on 539085  degrees of freedom
## AIC: 648234
## 
## Number of Fisher Scoring iterations: 4

APL_YEAR

From the output, I find out that APL_YEAR is positively correlated with the certification probability in each year and the probability has increased over the years with the coefficients increasing from 2011 to 2017. This indicates that an H-1B application filed in 2017 is more likely to get certified than an application filed in 2011 with all other variables held constant.

plot the certification rate in each year

Figure 1 shows the percentage of the total applicants as well as the certification rate in each year. It is obvious that the certification rate increases when the total number of applicants decreases. This is probably why the coefficient of APL_YEAR2017 is the biggest among the other year variables.

OCCUPATION

From the output, I discover that jobs that require more advanced professional and technical skills such as programmer, developer, analyst and engineer are more likely to get certified, on the contrary, jobs that are related to management like manager and executive are less likely to get certified given the negative coefficients. And for the jobs that do not belong to any of these categories, the probability will not change a lot from job to job.

plot the certification rate for each type of occupation

Figure 2 shows the interaction between the percentage of the total applicants and the certification rate for each type of occupation in each year. It is obvious that the average certification rate for computer-science related jobs is the highest in each year despite a decreasing trend. The average certification rate for developer jobs is the second highest and displays an increasing trend. is also increasing increases when the total number of applicants decreases. Jobs like accountant, engineer or analysts also have a fairly good certification rate. While the average certification rates of jobs like technician, manager, teacher and executive are much lower than those of other jobs.

FULL_TIME_POSITION

Not surprisingly, applications for full-time position are much more likely to get certified than those for part-time position.

WORSITE_STATE

The summary result shows some interesting insights of the state of the worksite. Most of the states are positively correlated to the certification probability. Among them, Arkansas, Connecticut, Rhode Island, Wisconcin and Washington show higher certification probabilities than other states given that their relatively larger coefficients. I find it surprising as most of them are not states normally known for openness to immigration such as California. On the other hand, coefficents associated with Nebraska, Nevada, Louisiana, South Carolina and Washington D.C. are negative, which suggest that applying to work in these states might negatively affect the certification probability.

Let’s visualize the data see if it aligns with the result above

Figure 4 shows the interaction between the percentage of the total applicants and the certification rate for each state in each year. According to the plot, I discover that if I take the number of applicants per state into account, California, New York, Georgia and Texas actually have the highest certification rates, which is not surprising given that the most dynamic & diverse metropolitan economies are located in these states (Dallas, Atlanta, New York City and Sanfrancisco). On the other hand, states that associated with large coefficients actually score poorly in terms of certification rate except Washington DC. While, states that are negatively correlated to the certification probability score the most poorly among all the states.

EMP_LEN

Figure 5 shows the interaction between the percentage of the total applicants and the certification rate for each contract length in each year. It is reasonable to conclude that the longer the contract length is, the higher the average certification rate is. However, if the contract length exceeds 3 years, then the certification probability actually gets worse in this case.

PREVAILING_WAGE

Figure 6 shows the relationship between the industry wage level associated with the job and the certification rate in each year. Obviously, the higher the wage level is the higher the probability that a certain applicaiton will get certified.

Conclusion:

In conclusion, decision tree model performs better than logistic regression model does. The decision tree model has higher accuracy of prediction according to the confusion matrix and the area under the ROC curve. Neither of the models possess much predictive power because the original dataset is highly skewed as the number of certified cases is almost 34 times more than that of denied cases. After transforming the dataset using synthetic data generation method, performances and predictive powers of both logistic regression and decision tree model are greatly improved with higher AUC values. Still, the new decision model outperforms the new logistic regression model by a great extent given that it not only has higher accuracy but also higher specificivity and sensitivity. The area under ROC curve also reaches a new high of almost 86% for the new decision tree model. Therefore, I conclude that the decision tree model performs better as a binary classification model.

At the same time, I determine several important characteristics of an LCA application that has high probability of being certified: i. The job associated with the application requires advanced professional and technical skills or is in the field of computer science. ii. A full-time position. iii. The worksite is in one of the states where a dynamic & diverse metropolitan economy is located such as Dallas, Atlanta, New York City and San Francisco. iv. The contract length of the employment is 2~3 years. v. The industry wage level of the associated job is higher than the average wage level.

STOR 320 Final Project

Yuqing Tang

11/19/2017

Overview and Motivation:

Initial Questions:

Data Cleaning and Wrangling:

Exploratory Data Analysis:

Exploratory Geo Plot

Exploratory Trend Plot

Final Question and Approach:

Data Analysis:

Logistic Regression

Modeling

Evaluating the performance of the model fitted

Decision Tree

Modeling:

Testing the performance of the model

Imbalanced Sample Data Correction

Logistic Regression (NEW)

Modeling

Evaluating the performance of the model fitted

Decision Tree (NEW)

Modeling:

Testing the performance of the model

Model Comparison

Variable Importance

Conclusion:

STOR 320 Final Project

Yuqing Tang

11/19/2017

Overview and Motivation:

Initial Questions:

Data Cleaning and Wrangling:

Exploratory Data Analysis:

Exploratory Geo Plot

Exploratory Trend Plot

Final Question and Approach:

Data Analysis:

Logistic Regression

Modeling

Evaluating the performance of the model fitted

Decision Tree

Modeling:

Testing the performance of the model

Imbalanced Sample Data Correction

Logistic Regression (NEW)

Modeling

Evaluating the performance of the model fitted

Decision Tree (NEW)

Modeling:

Testing the performance of the model

Model Comparison

Variable Importance

Conclusion:

Reference and Related Work: