Data Overview
Univariate Plots Section
Univariate Analysis
Bivariate Plots Section
Bivariate Analysis
Multivariate Plots Section
Multivariate Analysis
- Talk about some of the relationships you observed in this part of the
  investigation. Were there features that strengthened each other in terms of
  looking at your feature(s) of interest?
- Were there any interesting or surprising interactions between features?
Final Plots and Summary
Reflection

========================================================

Context: The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, an US employer must offer a job and petition for H-1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, PhD) and work in a full-time position.

Data Overview

The columns in the dataset include:

CASE_STATUS: Status associated with the last significant event or decision. Valid values include “Certified,” “Certified-Withdrawn,” Denied,” and “Withdrawn”.
EMPLOYER_NAME: Name of employer submitting labor condition application.
SOC_NAME: Occupational name associated with the SOC_CODE. SOC_CODE is the occupational code associated with the job being requested for temporary labor condition, as classified by the Standard Occupational Classification (SOC) System.
JOB_TITLE: Title of the job
FULL_TIME_POSITION: Y = Full Time Position; N = Part Time Position
PREVAILING_WAGE: Prevailing Wage for the job being requested for temporary labor condition. The wage is listed at annual scale in USD. The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment. The prevailing wage is based on the employer’s minimum requirements for the position.
YEAR: Year in which the H-1B visa petition was filed
WORKSITE: City and State information of the foreign worker’s intended area of employment
lon: longitude of the Worksite
lat: latitude of the Worksite

# Load the Data
h1b_df <- read.csv("~/Udacity/P4_Explore_Summarize_Data/project4/h1b_kaggle.csv")


# Convert all the lowercase characters to uppercase
h1b <- data.frame(lapply(h1b_df, function(v) {
  if (is.factor(v)) return(toupper(v))
  else return(v)
}))

h1b <- tbl_df(h1b)
glimpse(h1b)

## Observations: 3,002,458
## Variables: 11
## $ X                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ CASE_STATUS        <fctr> CERTIFIED-WITHDRAWN, CERTIFIED-WITHDRAWN, ...
## $ EMPLOYER_NAME      <fctr> UNIVERSITY OF MICHIGAN, GOODMAN NETWORKS, ...
## $ SOC_NAME           <fctr> BIOCHEMISTS AND BIOPHYSICISTS, CHIEF EXECU...
## $ JOB_TITLE          <fctr> POSTDOCTORAL RESEARCH FELLOW, CHIEF OPERAT...
## $ FULL_TIME_POSITION <fctr> N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ PREVAILING_WAGE    <dbl> 36067.0, 242674.0, 193066.0, 220314.0, 1575...
## $ YEAR               <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ WORKSITE           <fctr> ANN ARBOR, MICHIGAN, PLANO, TEXAS, JERSEY ...
## $ lon                <dbl> -83.74304, -96.69889, -74.07764, -104.99025...
## $ lat                <dbl> 42.28083, 33.01984, 40.72816, 39.73924, 38....

The WORKSITE variable contains both city and state information, so it’s more convenient for me to do more granular analysis if I split WORKSITE into STATE and CITY.

# Split WORKSITE column into two columns: city and state
h1b <- h1b %>%
    separate(WORKSITE, c("CITY", "STATE"), ", ")

# Change variables 'STATE' and 'YEAR' to ordered factors
h1b$STATE <- factor(h1b$STATE, ordered = TRUE)
h1b$YEAR <- factor(h1b$YEAR, ordered = TRUE)

glimpse(h1b)

## Observations: 3,002,458
## Variables: 12
## $ X                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ CASE_STATUS        <fctr> CERTIFIED-WITHDRAWN, CERTIFIED-WITHDRAWN, ...
## $ EMPLOYER_NAME      <fctr> UNIVERSITY OF MICHIGAN, GOODMAN NETWORKS, ...
## $ SOC_NAME           <fctr> BIOCHEMISTS AND BIOPHYSICISTS, CHIEF EXECU...
## $ JOB_TITLE          <fctr> POSTDOCTORAL RESEARCH FELLOW, CHIEF OPERAT...
## $ FULL_TIME_POSITION <fctr> N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ PREVAILING_WAGE    <dbl> 36067.0, 242674.0, 193066.0, 220314.0, 1575...
## $ YEAR               <ord> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ CITY               <chr> "ANN ARBOR", "PLANO", "JERSEY CITY", "DENVE...
## $ STATE              <ord> MICHIGAN, TEXAS, NEW JERSEY, COLORADO, MISS...
## $ lon                <dbl> -83.74304, -96.69889, -74.07764, -104.99025...
## $ lat                <dbl> 42.28083, 33.01984, 40.72816, 39.73924, 38....

Univariate Plots Section

H1B Visa Case Status

# Create a bar chart to show the number of H1B cases in different case statuses
ggplot(subset(h1b, !is.na(CASE_STATUS)), 
       aes(x = CASE_STATUS, fill = CASE_STATUS)) +
    geom_bar(stat = "count") +
    coord_trans(y = "sqrt") +
    ylab("No. of Applications") +
    theme(legend.position = c(0.9, 0.9), 
          legend.title = element_text("Case Status"), 
          legend.key.size = unit(0.3, "cm"),
          axis.text.x=element_text(angle = -10, hjust = 0, size = rel(1))) +
    ggtitle("Distribution of H1B visa case status")

After plotting out the distribution of case status, we can easily see that certified cases dominate in this dataset. The bar chart above shows us the distribution of H1B visa status, a vast majority of the case status is “Certified” in this dataset. So my further analysis will only depend on CERTIFIED cases, which will provide more accurate insights to this scenario.

Popular H1B Visa Sponsors

# Keep only "CERTIFIED" H1B cases
certified_h1b <- h1b %>%
    filter(CASE_STATUS == "CERTIFIED")



# Function to return the top N employers that have the most H1B workers
top_N_employers <- function(num_emp) {
    certified_h1b %>%
        group_by(EMPLOYER_NAME) %>%
        summarise(num_apps = n()) %>%
        arrange(desc(num_apps)) %>%
        slice(1:num_emp)
}


# Bar plot to show the top 10 employers who filed the most h1b visa applications
ggplot(top_N_employers(10), 
       aes(x = reorder(EMPLOYER_NAME, num_apps), y = num_apps)) +
    geom_bar(stat = "identity", alpha = 0.9, fill = "sky blue", width = 0.7) +
    coord_flip() +
    scale_y_continuous(limits = c(0, 140000), breaks = seq(0, 140000, 20000)) +
    geom_text(aes(label = num_apps), hjust = -0.2, size = 2) +
    ggtitle("Top 10 Employers with most applications") +
    theme(plot.title = element_text(size = rel(1)),
          axis.text.y = element_text(size = rel(0.8))) +
    labs(x = "Employer Name", y = "No. of Applications")

The bar chart tells us that Infosys Limited filed more than twice as many H1B visa applications as Tata Consultancy did within the six-year period. The top 3 companies in this chart are all India information technology companies.

Popular H1B Occupations

# Function to return top N occupations that have the most H1B applicants
top_N_SOC <- function(num) {
    certified_h1b %>%
        filter(!is.na(certified_h1b$SOC_NAME)) %>%
        group_by(SOC_NAME) %>%
        summarise(num_apps = n()) %>%
        arrange(desc(num_apps)) %>%
        slice(1:num)
}

# Bar plot to show the top 10 H1B occupations 
ggplot(top_N_SOC(10), 
       aes(x = reorder(SOC_NAME, num_apps), y = num_apps)) +
    geom_bar(stat = "identity", alpha = 0.9, fill = "orange", width = 0.7) +
    coord_flip() +
    scale_y_continuous() +
    geom_text(aes(label = num_apps), hjust = -0.2, size = 2) +
    ggtitle("Top 10 occupations with most H1B petitions") +
    theme(plot.title = element_text(size = rel(1)),
          axis.text.y = element_text(size = rel(0.8))) +
    labs(x = "SOC Name", y = "No. of Applications")

The most popular occupations are computer system analysts, software developers, applications, computer programmers, computer occupations, all other, software developers, systems software, management analysts, financial analysts, accountants and auditors, mechanical engineers, network and computer systems administrators.

Obviously, tech occupations far outnumbered non-tech occupations. Among tech occupations, computer systems analyst is the most popular occupation for H1B applicants.

For more detailed information about each of these SOC names, please look them up via this link.

Full-time jobs v.s. Non full-time jobs

# Summarize full-time and non full-time positions
full_time_pos <- certified_h1b %>%
    filter(!is.na(FULL_TIME_POSITION)) %>%
    group_by(FULL_TIME_POSITION) %>%
    summarise(n = n()) %>%
    mutate(rel_freq = paste0(round(100 * n/sum(n), 2), "%"),
           pos = cumsum(n) - n/2)

# Pie chart showing proportion of full-time and non full-time jobs
ggplot(full_time_pos, 
       aes(x = factor(1), y = n, fill = factor(FULL_TIME_POSITION, levels = c("Y", "N")))) + 
    geom_bar(width = 1, stat = "identity", color = "grey") +
    labs(fill = "Full time position?") +
    coord_polar(theta = "y") +
    geom_text(aes(x = factor(1), y = pos, label = rel_freq), size=5) +
    theme_void() +
    theme(legend.position = "bottom", legend.text = element_text(size = 14),
          plot.title = element_text(size = rel(1))) +
    ggtitle("Full time positions vs non full time positions")

Next, let’s take a look at the percentages of full-time jobs and non full-time jobs. 85.8% of the total H1B visa applicants have full-time jobs, while the remaining 14.2% filed H1B visa petitions based on their part-time jobs.

Distribution of Prevailing Wage

# Generate a sample of wages and filter out NA values
set.seed(123)
wage_sample <- subset(certified_h1b[sample(1:nrow(certified_h1b), 300000), 7], 
                      !is.na(PREVAILING_WAGE))

# Histogram of sampled wages
ggplot(subset(wage_sample, 
              PREVAILING_WAGE >= quantile(PREVAILING_WAGE, 0.1) & 
                  PREVAILING_WAGE <= quantile(PREVAILING_WAGE, 0.95)), 
              aes(x = PREVAILING_WAGE)) + 
    geom_histogram(binwidth = 1000,
                   col = "red", 
                   fill = "blue", 
                   alpha = .2) +
    scale_x_continuous(breaks = seq(min(wage_sample$PREVAILING_WAGE), 
                                    max(wage_sample$PREVAILING_WAGE),
                                    5000)) +
    labs(title = "Histogram of Prevailing Wage)", 
         x = "prevailing wage")

summary(certified_h1b$PREVAILING_WAGE)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     10504     54766     65125     72554     81432 306049120

To investigate the prevailing_wage variable, histogram would be the best visualization to see the distribution. However, there are more than 3 million records in the dataset with a lot of extreme values. The alternative way to show the histogram of wage is to randomly sample about one tenth of the records and exclude the bottom 10 percent and top 5 percent data points from the sampled dataset.

Now I get the perfect histogram of prevailing wages. This right skewed distribution tells us that most foreign workers’ wages are between $60,000 and $65,000. The right tail of the distribution shows us that there are fewer foreign workers as the wages increase.

But the biggest flaw in this histogram is that I didn’t adjust the wage for inflation. This chart include all the data from 2011 to 2016.

H1B petitions by state

# Import state data containing state abbreviations and locations
data(state)
states <- data.frame(state.abb, state.name, state.area, state.center,
                     state.division, state.region, state.x77)

row.names(states) <- NULL
states <- states %>%
    select(state.abb, state.name, x, y, Population, Income, Illiteracy)
states <- subset(states, ! state.abb %in% c("HI", "AK"))


# Count H1B petitions filed by each state
petition_by_state <- certified_h1b %>%
    filter(STATE != "NA") %>%
    group_by(region = tolower(STATE)) %>%
    summarise(no_petitions = n()) %>%
    arrange(desc(no_petitions))


# Draw the map
us_states <- map_data("state")

petition_by_state <- inner_join(petition_by_state, us_states, by = "region")

plot_petition_by_state <- ggplot(petition_by_state, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = cut_number(no_petitions, 7, dig.lab=6))) +
    geom_path(color = "gray", linestyle = 2) +
    coord_map() +
    geom_text(data = states, aes(x = x, y = y, label = state.abb, group = NULL),
              size = 4, color = "white") +
    scale_fill_brewer('Number of petitions', palette  = 'PuRd', label = scales::comma) +
    ggtitle("H1B petitions by state") +
    theme_bw()
    
plot_petition_by_state

California has the most H1B visa petitions during the past five years, followed by Texas and New York state. Later we will dive in deeper to find out the H1B visa petitions by cities.

Univariate Analysis

What is the structure of your dataset?

There are 3,002,458 records in the dataset with 11 features (case_status, employer_name, soc_name, job_title, full_time_position, prevailing_wage, year, worksite, longitude, latitude).

There are some interesting observations derived from the univariate analysis:

The case status of the majority in this dataset is “Certified”.
Top 3 employers who filed the most H1B visa for their foreign employees are all Indian IT companies.
The median previaling wage for a H1B worker is $65,020 and the max price is $6,998,000,000. The interquantile range of prevailing wage is $27,061.
California, New York and Texas filed the most H1B visas from 2011 to 2016.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the dataset are prevailing wage and quantity of H1B petitions. I would like to see how the prevailing wage varies accross different employers and occupations. I wonder whether some employers tend to give a more generous salary package to H1B workers than others, whether particular occupations receive a higher salary than others. I’m also curious about changes in wages and quantity of H1B petitions from 2011 to 2016.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

SOC_NAMES, and EMPLOYERS. I would like to see how the prevailing wage varies accross different employers and occupations. I wonder whether some employers tend to give a more generous salary package to H1B workers than others, whether particular occupations receive a higher salary than others.

STATE, longitude and latitude are likely to contribute to the different wage levels and quantity of H1B petitions. My question is whether H1B workers living in areas with high living expenses tend to get higher wages correspondingly compared to those in other areas, and whether some economically developed areas are more willing to hire more foreign workers.

Did you create any new variables from existing variables in the dataset?

Yes, I split the WORKSITE variable into two new variables, CITY and STATE. With this data transformation, I can easily conduct more granular analysis based on locations.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Prevailing Wage:

When I investigated the distribution of prevailing wage, I find that there are many extreme values that distort the shape of the histogram. So I digged into the data deeper and noticed that the minimum wage is $0 and the maximum wage reaches 6 billion dollars! The anomalies in the dataset severely affect my analysis, so I decide to discard these extreme values and only focus on analyzing the middle range of the wage data.

Case Status:

At first glance, it’s hard for me to believe that there are so many H1B visa petitions during the past six years because a total of 85,000 cap subject H1B visas are available and can be issued each year. Of the 85,000 cap subject visas, 65,000 are available for the Regular Cap, while 20,000 are available for the ADE (Advanced Degree Exemption) Cap.

After an extensive research, I found the answer on the Kaggle discussion forum: the data contains New H1B petitions(before the lottery) + Extension Petitions + Positions exempt from H-1B visa cap ( PHD, Researchers ). For the CASE_STATUS, “CERTIFIED”" does not mean the applicant got his/her H1B visa approved, it just means that he/she is eligible to file an H1B.

Bivariate Plots Section

Trend of H1B petitions over years

# Count H1B petitions filed in each year
case_quantity_per_year <- certified_h1b %>%
    group_by(YEAR) %>%
    summarise(all_occupations = n())

# Bar plot showing the H1B quantities in each year
ggplot(case_quantity_per_year, aes(y = all_occupations, x = YEAR, fill = YEAR)) + 
      geom_bar(stat = "identity", alpha = 0.7, width = 0.5) + 
      scale_y_continuous(limits = c(0, 570000), 
                         breaks = seq(0, 570000, 100000),
                         labels = scales::comma) + 
      ggtitle("Changes in quantity of H1B cases over five years")+
      theme(
            plot.title = element_text(size = rel(1.3)),
            panel.background = element_rect(fill = '#f0f0f0'),
            legend.position = "none"
            )

Comparison of H1B petitions between California, New York and Texas

# Count H1B petitions filed in CA, NY and TX in each year
cnt_case_per_year <- certified_h1b %>%
    filter(STATE %in% c("CALIFORNIA", "NEW YORK", "TEXAS")) %>%
    group_by(YEAR, STATE) %>%
    summarise(count = n()) %>%
    arrange(YEAR, STATE) 

# Bar plot showing H1B quantities of each state in each year
ggplot(cnt_case_per_year, aes(x = YEAR, y = count, fill = STATE)) +
    geom_bar(stat = "identity", position = position_dodge(), alpha = 0.8,
             color = "grey") +
    ggtitle("Quantity of H1B cases in California, New York and Texas from 2011 to 2016") +
    theme(legend.position = "bottom",
         plot.title = element_text(size = rel(1.3)))

# Display the table with specific numbers
cnt_case_per_year <- cnt_case_per_year %>%
    spread(key = STATE, value = count)

knitr::kable(cnt_case_per_year)

YEAR	CALIFORNIA	NEW YORK	TEXAS
2011	56252	35244	26851
2012	64537	37086	31841
2013	72171	36460	36408
2014	85164	42169	45091
2015	100710	47703	55066
2016	104070	51293	59694

Prevailing wages of top 10 employers

# Top 10 employers who filed the most H1B petitions
top_10_employers <- certified_h1b %>%
        group_by(EMPLOYER_NAME) %>%
        summarise(num_apps = n()) %>%
        arrange(desc(num_apps)) %>%
        slice(1:10) %>%
        select(EMPLOYER_NAME)

employers_boxplot_df <- certified_h1b %>%
    filter(EMPLOYER_NAME %in% top_10_employers$EMPLOYER_NAME)

# Boxplot showing the wage distribution of each employer
ggplot(employers_boxplot_df, aes(y = PREVAILING_WAGE, x = EMPLOYER_NAME, 
                                 fill = EMPLOYER_NAME, notch = TRUE, notchwidth = .3)) + 
      geom_boxplot(notch = TRUE) + 
      scale_y_continuous(limits = c(0, 150000), 
                         breaks = seq(0, 150000, 10000)) + 
      ggtitle("Wages for H1B cases in top 10 companies")+
      theme(
            plot.title = element_text(size = rel(1.3)),
            panel.background = element_rect(fill = '#f0f0f0'),
            axis.text.x=element_blank(),
            legend.position = "bottom", 
            legend.title = element_text(size = rel(0.7)),
            legend.text = element_text(size = rel(0.4)), 
            panel.grid.major = element_line(colour = '#f0f0f0'),
            panel.grid.major.x = element_line(linetype = 'blank'),
            panel.grid.minor = element_line(linetype = 'blank')  
      )

From the boxplot we can see that the median wage of Microsoft exceed all the other major sponsor companies given the wage range between $0 and $15,000. The interquantile range of prevailing wages of Tata Consultancy is the smallest compared to that of other companies, in other words, Tata Consultancy has the least variation in wages for the middle 50% of H1B workers.

Top 10 occupations with highest median prevailing wages

# Top 10 occupations with the highest wages
top_10_soc_with_highest_wage <- certified_h1b %>%
        group_by(SOC_NAME) %>%
        summarise(median_wage = median(PREVAILING_WAGE)) %>%
        arrange(desc(median_wage)) %>%
        slice(1:10) %>%
        select(SOC_NAME, median_wage)

# Bar plot showing median wages for each occupation
ggplot(top_10_soc_with_highest_wage, aes(y = median_wage, x = reorder(SOC_NAME, median_wage))) + 
      geom_bar(stat = "identity", fill = "blue", alpha = 0.7, width = 0.7) + 
      #scale_y_continuous(limits = c(0, 150000), 
                         #breaks = seq(0, 150000, 5000)) + 
      ggtitle("Top 10 occupations with highest median prevailing wages") +
      coord_flip() +
      theme(
            plot.title = element_text(size = rel(1)),
            axis.text.x=element_text(size = rel(0.8)),
            legend.position = "bottom"
      ) +
      labs(x = "Occupational Name")

knitr::kable(top_10_soc_with_highest_wage)

SOC_NAME	median_wage
PHYSICIANS AND SURGEONS, ALL OTHTER	230605.0
UROLOGISTS	213158.0
STRUCTURAL METAL FABRICATORS AND FITTERS	204090.0
SECURITIES, COMMODITIES, AND FINANCIAL SERVICES S	201510.0
FAMILY & GENERAL PRACTITIONERS	194188.8
PEDIATRICIANS	189731.0
INTERNISTS	187200.0
DENTIST	182874.0
PHYSICIAN AND SURGEONS, ALL OTHER	178235.2
PHYSICIANS AND SURGEONS	174179.0

Because the dataset has a lot of outliers and is severely skewed, using the median wage as the metric to compare prevailing wages of different occupations will help reduce distortion and provide a better picture. Based on the median wage, the occupation having the highest median wage is PHYSICIANS AND SURGEONS. Out of the top 10 high income occupations, 8 are in the medical and health care field. I also find out that none of the top 10 high income occupations is included in the top 10 occupations with the most H1B petitions (computer system analysts, software developers, etc.).

Certified Cases v.s. Denied Cases

certified_denied_h1b <- h1b %>%
    filter(CASE_STATUS == "CERTIFIED" | CASE_STATUS == "DENIED")

# Boxplot of prevailing wage in these two categories
ggplot(certified_denied_h1b, aes(y = PREVAILING_WAGE, x = CASE_STATUS, 
                                 fill = CASE_STATUS, notch = TRUE, 
                                 notchwidth = .3)) + 
      geom_boxplot(notch = TRUE) + 
      scale_fill_manual(values = c("#29a329", "#ea4b1f"))+
      scale_y_continuous(limits = c(0, 150000), 
                         breaks = seq(0, 150000, 10000)) + 
      ggtitle("Wages for certified & denied H1B cases")+
      theme(
            plot.title = element_text(size = rel(1.3)),
            panel.background = element_rect(fill = 'light gray'),
            panel.grid.major = element_line(colour = '#f0f0f0'),
            panel.grid.major.x = element_line(linetype = 'blank'),
            panel.grid.minor = element_line(linetype = 'blank')
      )

In general, the prevailing wages of denied H1B cases have more extreme values than those of certified H1B cases. The 1st quartile, median and 3rd quartile of prevailing wages for certified cases are greater than those of denied cases respectively.

Wage distribution for each year

# Randomly sample one tenth of total H1B petitions, excluding Null values in wages
wage_year_sample <- subset(certified_h1b[sample(1:nrow(certified_h1b), 300000), c(7, 8)], 
                      !is.na(PREVAILING_WAGE) & 
                          PREVAILING_WAGE <= quantile(certified_h1b$PREVAILING_WAGE, 0.99))

# Histograms of wages in each year
wage_dist_per_year <- ggplot(subset(wage_year_sample, 
              PREVAILING_WAGE >= quantile(PREVAILING_WAGE, 0.1) & 
                  PREVAILING_WAGE <= quantile(PREVAILING_WAGE, 0.95)), 
              aes(x = PREVAILING_WAGE)) + 
    geom_histogram(binwidth = 1000,
                   col = "red", 
                   fill = "blue", 
                   alpha = .2) +
    facet_wrap(~YEAR) +
    scale_x_continuous(breaks = seq(min(wage_sample$PREVAILING_WAGE), 
                                    max(wage_sample$PREVAILING_WAGE),
                                    10000)) +
    labs(title = "Histograms of Prevailing Wages by year", 
         x = "prevailing wage") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

wage_dist_per_year

For each year, the bulk of H1B applicants have salaries between $50,000 and $70,000. The distribution of wage for each year is right skewed.

Trend of wages from 2011 to 2016

set.seed(123)

# Filter out the top 1% outliers in the prevailing_wage variable
wage_year_sample <- subset(certified_h1b[sample(1:nrow(certified_h1b), 300000), 
                                         c(7, 8)], 
                      !is.na(PREVAILING_WAGE) & 
                          PREVAILING_WAGE <= quantile(certified_h1b$PREVAILING_WAGE, 
                                                      0.99))

wage_year_sample <- wage_year_sample %>%
    group_by(YEAR) %>%
    mutate(mean_wage = mean(PREVAILING_WAGE),
           median_wage = median(PREVAILING_WAGE),
           '10th_percentile' = quantile(PREVAILING_WAGE, 0.1),
           '90th_percentile' = quantile(PREVAILING_WAGE, 0.9))

wage_year_stats <- wage_year_sample %>%
    distinct(mean_wage, median_wage, `10th_percentile`, `90th_percentile`)
    
wage_year_stats <- wage_year_stats[order(wage_year_stats$YEAR), 
                                   c(ncol(wage_year_stats), 
                                     1:(ncol(wage_year_stats) - 1))]
    
# From wide to long format
wage_year_stats_long <- gather(wage_year_stats, statistics, values, 
                               mean_wage:`90th_percentile`,
                               factor_key = TRUE)

# Trend of median, mean, 10th percentile and 90th percentile of wages
wage_trend <- ggplot(wage_year_stats_long, 
                     aes(x = YEAR, y = values, group = statistics)) +
    geom_line(aes(color = statistics), lineend = "round", size = 1) +
    expand_limits(y = 0) +
    scale_y_continuous(breaks = seq(0, 120000, 10000), labels = scales::comma) +
    ggtitle("Wage Trend from 2011 to 2016: Line Chart") +
    labs(y = "wage / $") +
    theme(plot.title = element_text(size = rel(1.3)),
          legend.position = "bottom")
    
wage_trend

knitr::kable(wage_year_stats)

YEAR	mean_wage	median_wage	10th_percentile	90th_percentile
2011	65088.39	61173	40123.0	94952
2012	66658.97	62546	42099.0	96138
2013	68426.64	63898	44616.0	98342
2014	69405.36	64688	45406.0	98675
2015	70561.24	66019	47247.2	99939
2016	72778.78	68141	48318.0	104118

This plot depicts the mean, median, 10th percentile and 90th percentile of wages for all the H1B workers in each year. From 2011 to 2016, wages for H1B applicants are increasing gradually at a reltively low rate. The mean wages being larger than the median wages every year indicates that the wage distribution is right skewed and these outliers are so extreme that they drag the mean wages up.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Considering the payments in the five years period, average payments for foreign workers with certified H1B visa are higher than those with denied H1B. Some denied cases even have payments of $0.
Microsoft, one of the biggest tech companies, has the highest average wage for employees with certified H1B. Big consulting firms such as Accenture and Deloitte also tend to hire many foreign workers and offer them good salaries.
Despite the fact that certified H1B applicants in the medical science and health care field account for only a small fraction of the total H1B applicants, these people earn much more than people in other fields. Physicians and surgeons have an average annual salary above $200,000.
When investigating the trend of wages over the past five years, I found that the average and median wages for H1B workers are going up steadily.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

California, New York and Texas are top 3 states that filed most H1B visa applications, so I decide to compare the H1B case quantity of each state per given year. From the bar plot above, it’s clear to see that California outnumbered the other two in each year. In 2011 and 2012, Texas has less certified H1B workers than New York state, but since 2013 Texas started to catch up and exceeded New York in 2014, 2015 and 2016. The number of certified H1B workers has been increasing over the past five years in all the three states.

What was the strongest relationship you found?

Year and quantity of H1B petitions are positively correlated, the same situation applies to year and prevailing wage. The US economy continues growing steadily over the past six year, bringing more job opportunities to the US job market as well as widespread wage increase.

Multivariate Plots Section

Distribution of H1B applicants in California, Texas and New York

California, Texas and New York are top 3 states that has the most H1B applicants over the past six years. I plotted the locations using longtitude and latitude variables of each record so that it’s easier to see the distribution of H1B applicants on the map.

# Function to get data for particular states
# Remove records containing NA values in longitutde and latitude

get_geo_data <- function(source_df, state) {
    df <- source_df %>%
        filter(!is.na(lon) & !is.na(lat) & STATE == state)
    return(df)
}

# Function to download Google map
get_state_map <- function(state, zoom_number) {
    get_map(location = state, source = "google",
            maptype = "roadmap", zoom = zoom_number)
}

# Plot California map
Cal_df <- get_geo_data(certified_h1b, "CALIFORNIA")
Cal_Map <- get_state_map("california", 6)

Cal_plot <- ggmap(Cal_Map) +
    geom_point(aes(x=lon, y=lat), data=Cal_df, col="blue", alpha=0.02) +
    ggtitle("H1B applicants in California")

# Plot Texas map
Tx_df <- get_geo_data(certified_h1b, "TEXAS")
Tx_Map <- get_state_map("Texas", 6)
Tx_plot <- ggmap(Tx_Map) +
    geom_point(aes(x=lon, y=lat), data=Tx_df, col="blue", alpha=0.02) +
    ggtitle("H1B applicants in Texas")

# Plot New York map
Ny_df <- get_geo_data(certified_h1b, "NEW YORK")
Ny_Map <- get_state_map("New York", 7)
Ny_plot <- ggmap(Ny_Map) +
    geom_point(aes(x=lon, y=lat), data=Ny_df, col="blue", alpha=0.02) +
    ggtitle("H1B applicants in New York")

grid.arrange(Cal_plot, Tx_plot, Ny_plot, ncol=2)

The dots cluster around San Francisco Bay Area, Los Angelous and San Diego in California. Metropolitans in Texas that have the most H1B applicants are Dallas-Fort Worth area and Houston. For New York state, NYC including Long Island is far ahead than any other cities.

Changes in the proportion of computer and mathematical occupations

# Computer and Mathematical Occupations 
cmo <- c("COMPUTER AND INFORMATION RESEARCH SCIENTISTS",
         "COMPUTER SYSTEMS ANALYSTS", "INFORMATION SECURITY ANALYSTS",
         "COMPUTER PROGRAMMERS", "SOFTWARE DEVELOPERS, APPLICATIONS",
         "SOFTWARE DEVELOPERS, SYSTEMS SOFTWARE", "WEB DEVELOPERS", 
         "DATABASE ADMINISTRATORS",
         "NETWORK AND COMPUTER SYSTEMS ADMINISTRATORS", 
         "COMPUTER NETWORK ARCHITECTS", 
         "COMPUTER USER SUPPORT SPECIALISTS", "COMPUTER NETWORK SUPPORT SPECIALISTS", 
         "COMPUTER OCCUPATIONS, ALL OTHER", "ACTURIES", "MATHEMATICIANS", 
         "OPERATIONS RESEARCH ANALYSTS",
         "STATISTICIANS", "MATHEMATICAL TECHNICIANS", 
         "MATHEMATICAL SCIENCE OCCUPATIONS, ALL OTHER")


# Generate a dataset containing H1B records with computer and mathematical occupations
cmo_quantity_per_year <- certified_h1b %>%
    filter(SOC_NAME %in% cmo & !is.na(PREVAILING_WAGE)) %>%
    group_by(YEAR) %>%
    summarise(cm_occupations = n(),
           cm_median_wage = median(PREVAILING_WAGE))

# Join two datasets together
multi_df1 <- merge(x = cmo_quantity_per_year,
                   y = case_quantity_per_year,
                   by = "YEAR",
                   all = TRUE)

# Calculate the percentage of computer and mathematical occupations 
multi_df1 <- multi_df1 %>%
    mutate(other_occupations = all_occupations - cm_occupations,
           cm_percent = cm_occupations / all_occupations) %>%
    select(YEAR, cm_occupations, other_occupations)

multi_df1 <- gather(multi_df1, occupation, count, cm_occupations:other_occupations)
    
# Stacked bar plot showing the proportions of computer and 
# mathematical occupations over years
ggplot(multi_df1,
       aes(x = YEAR, y = count, fill = occupation)) +
    geom_bar(stat = "identity") +
    scale_y_continuous(breaks = seq(0, 600000, 100000), 
                       limits = c(0, 600000), labels = scales::comma) +
    labs(x = "Year", y = "No. of H1B Cases", 
         title = "Change in the proportion of computer and mathematical occupations")

Next, I plotted a stacked bar chart to investigate the trend of the proportion of H1B applicants who have computer or mathematical occupations from 2011 to 2016. The proportion of computer or mathematical occupations over the total occupations gradually increased over years. With the flourishes in Internet industry, the demand for talents with computer-related skills has been growing year by year.

Trend of Wages for Computer or Mathematical Occupations and Other Occupations

# Create a catagorical variable to seperate computer and mathematical occupations
# from other occupations
multi_df2 <- certified_h1b %>%
    filter(!is.na(PREVAILING_WAGE)) %>%
    mutate(occupation_catagory = if_else(SOC_NAME %in% cmo, 
                                         "cm_occupations", "other_occupations")) %>%
    select(YEAR, occupation_catagory, PREVAILING_WAGE)

# Boxplot showing the distribution of wages for computer and mathematical 
# occupations and wages for other occupations
cmo_boxplot <- ggplot(multi_df2, aes(x=occupation_catagory, y= PREVAILING_WAGE)) +
    geom_boxplot(aes(fill=YEAR)) + xlab("Occupation Catagory") + ylab("Wage (USD)") +
    ggtitle("Wages of Computer/Mathematical Occupations and Other Occupations") +
    coord_cartesian(ylim=c(25000,150000))

cmo_boxplot

There is an obvious increase in the wages for computer and mathematical occupations. The median wage jumped a lot from 2011 to 2012, after that we can see a steady increase.
While for other occupations, there is some fluctuations in the median wage and no apparent increase over years. Besides, variance in wages for other occupations across the six years is larger than those for computer and mathematical occupations.

Distribution of computer and mathematical occupations in the USA

# Summarize the H1B data by city, get the total number of computer and mathematical 
# H1B cases over the six years by cities, select cities with no less than 100 H1B cases
cmo_by_cities <- certified_h1b %>%
    filter(SOC_NAME %in% cmo & !is.na(CITY) & !is.na(lon) & !is.na(lat)) %>%
    group_by(STATE, CITY, lon, lat) %>%
    summarise(computer_mathematical_occupations = n()) %>%
    filter(computer_mathematical_occupations >= 2000) %>%
    arrange(desc(computer_mathematical_occupations))


cmo_map <- plot_petition_by_state + 
    geom_point(data = cmo_by_cities, aes(x = lon, y = lat, 
                                         size = computer_mathematical_occupations,
                                         alpha = 0.05, group = NULL)) +
    scale_size(range = c(1,10)) +
    guides(alpha = FALSE) +
    labs(title = 'Distribution of Computer and Mathematical 
         Occupations across US Major Cities')

cmo_map

# Top 20 cities with the most computer and mathematical occupations
knitr::kable(cmo_by_cities[1:20, 
                           c("STATE", "CITY", "computer_mathematical_occupations")])

STATE	CITY	computer_mathematical_occupations
NEW YORK	NEW YORK	63053
GEORGIA	ATLANTA	35727
TEXAS	HOUSTON	35241
CALIFORNIA	SAN FRANCISCO	34461
CALIFORNIA	SAN JOSE	29342
ILLINOIS	CHICAGO	25530
CALIFORNIA	SUNNYVALE	22837
NORTH CAROLINA	CHARLOTTE	22709
TEXAS	IRVING	22622
WASHINGTON	REDMOND	20391
NEW JERSEY	JERSEY CITY	18925
CALIFORNIA	MOUNTAIN VIEW	18393
TEXAS	DALLAS	16501
WASHINGTON	SEATTLE	15922
ARIZONA	PHOENIX	14879
GEORGIA	ALPHARETTA	14764
TEXAS	AUSTIN	14677
CALIFORNIA	SANTA CLARA	14636
TEXAS	PLANO	14138
WASHINGTON	BELLEVUE	13745

Not surprisingly, for states that filed the most H1B petitions, the numbers of computer and mathematical positions are correspondingly large. This is because computer and mathematical occupations constitute a big part in the total H1B workers’ positions.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Computer and mathematical occupations have accounted for a majority of total H1B cases since 2013. As there are more and more computer and mathematical occupations appearing from 2011 to 2016, the quantity of total H1B visa petitions also shows an upward trend.

It also makes sense that the wages of the majority will have an impact on the wages of all H1B applicants. H1B applicants who have computer or mathematical occupations are often provided with a higher salary, which might potentially drive up the wage level of all H1B applicants.

Were there any interesting or surprising interactions between features?

Computer and mathematical occupations gather in metropolitan areas. More developed areas are in need of a great number of talents with technical background. As the biggest center of technology in the United States, the Bay Area (San Francisco, San Jose, Sunnyvale, Mountain View, Santa Clara) has the most H1B workers in the computer and mathematical fields.

Final Plots and Summary

Plot One

wage_dist_per_year

Description One

From the above unimodal, right skewed histograms of wage, we can see that the peaks on the left side is becoming higher and higher, meaning most H1B workers earn wages between $55,000 and $70,000 in each year from 2011 to 2016.

The histograms also tell us that there are more H1B workers in 2015 and 2016 than in previous years, which is an indicator of US economy recovery.

Plot Two

cmo_boxplot

Description Two

Generally, wages for computer and mathematical occupations are higher than other occupations. The median wage level for H1B workers having computer or mathematical jobs is going up every year, with more and more people earning higher than $120,000 annually. During the recent years, computer techlogy is developing much faster than ever before and big data is booming as well, thus it’s not surprising to see the high demand for talents in the computer and mathematical areas.

Plot Three

cmo_map

Description Three

San Francisco Bay Area, New York City, Seattle, Atlanta, Washington DC, Chicago, Dallas-Fort Worth, Houston are metropolitans that have a great amount of opportunities of computer and mathematical positions. Correspondingly, the states where these metropolitans locate are popular states that filed many more H1B petitions than other states.

Reflection

When I started out by examing the structure of the H1B dataset, I dealt with missing values, outliers and filtered out irrelevant records, but I missed some inconsistent data points with mixed-case characters, which caused a big problem in my further analysis. For example, the values of SOC_NAME in 2011 contain mixed case letters but in 2016 are all uppercase letters, I got a completely wrong picture due to this mistake. The raw dataset we obtained may not be as clean as we thought, exploratory data analysis requires an open, inquisitive, and skeptical mind and strong attention to detail.

The only numerical variable directly available in this dataset is PREVAILING_WAGE, so I decided to put more effort in analyzing the wages amongst H1B workers, based on different groups. When I tried to draw a histogram of wages, I noticed that the huge amount of data combined with the existence of extremely large values slow down the process. So I randomly sampled about one tenth of all the data points and filter out the outliers. The final histogram I generated perfectly mimics the true distribution of wages for all H1B workers from 2011 to 2016.

The selection of the appropriate types of visualization should be based on the variables in the dataset. However, most of the time it is necessary to create new variables in order to make more informative data visualization. By splitting the WORKSITE variable into STATE and CITY, I’m able to draw a map displaying popular cities for H1B workers. By extracting computer and mathematical occupations from the dataset, I find that foreign workers with such kind of technical skills will have a higher chance of getting hired because of the increaing opportunities in the US job market.

One of the biggest limitations is that the dataset lacks of the academic background of H1B workers. Some achieved their university degrees in the US, while others hold their university degrees in their home countries. Some followed the career paths guided by what they learned from school, while others made a career transition after graduation. These are important information that could help me conduct drill-down analysis in terms of US employers’ preference for US universities or STEM majors.

H1B Visa Petition Analysis

Sheena Yu

April 21, 2017