H1B Visa Analysis (2014 -2016)

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(usmap)
library(sf)
library(ggmap)
library(plotly)
library(dplyr)
library(lubridate)
library(forcats)
library(rcartocolor)
library(gt)

Goal

The current administration adopts more friendly immigration policies. We want to help people to find visa-friendly companies and land their first job in the U.S by looking at the location, industry, number of visa applications, and visa certified rates so that job seekers can seize opportunities before the immigration policies change again.

Data

Source: https://www.kaggle.com/datasets/jboysen/us-perm-visas
The data contains information on H1B visa applications roughly between 2006-2016.
For this analysis, we cleaned and used the data between 2014 and 2016.
The clean data has 162,442 observations and 24 variables (see below).

# Import dataset
ds1<-read_csv("h1b_clean_new.csv")
ds<-ds1
# Change the class of variables
ds$case_status <- factor(ds$case_status)
ds$case_received_date <- as.Date(ds$case_received_date,"%m/%d/%Y")
ds$decision_date <- as.Date(ds$decision_date,"%m/%d/%Y")
ds$employer_state <-factor(ds$employer_state)
ds$job_info_work_state <- factor(ds$job_info_work_state)
ds$foreign_worker_info_state <- factor(ds$foreign_worker_info_state)
ds$country_of_citizenship <- factor(ds$country_of_citizenship)

# Remove unnecessary characters 
ds$foreign_worker_info_education<- gsub("'s", "", ds$foreign_worker_info_education)
ds$foreign_worker_info_education <- factor(ds$foreign_worker_info_education)
ds$job_info_education <-gsub("'s","",ds$job_info_education)
ds$job_info_education <-gsub("High School","HighSchool",ds$job_info_education)
ds$year <- year(ds$case_received_date)

# Filter data to year>=2014
ds<- ds %>% filter(year>2013)

##  [1] "case_no"                       "case_received_date"           
##  [3] "case_status"                   "class_of_admission"           
##  [5] "country_of_citizenship"        "decision_date"                
##  [7] "employer_name"                 "employer_num_employees"       
##  [9] "employer_state"                "employer_yr_estab"            
## [11] "foreign_worker_info_education" "foreign_worker_info_state"    
## [13] "job_info_education"            "job_info_job_title"           
## [15] "job_info_work_state"           "pw_amount_9089"               
## [17] "pw_job_title_908"              "pw_unit_of_pay_9089"          
## [19] "us_economic_sector"            "wage_offer_from_9089"         
## [21] "wage_offer_to_9089"            "wage_offer_unit_of_pay_9089"  
## [23] "wage_offered_unit_of_pay_9089" "year"

Trend

Looking at the visa applications data, we noticed :
- There were 162,442 working visa (H1B) applications sponsored by 29,884 companies in the U.S. between 2014 and 2016.
- Tightened immigration policies from 2017 to 2021 and the Covid-19 pandemic since 2020 would also influence the number of applications.
Between 2021 and 2022, H1B visa applications has increased by 56.8 % (https://www.americanimmigrationcouncil.org/research/h1b-visa-program-fact-sheet)
- Based on Bureau of Labor Statistics data, foreign-born workers in the United States in 2021 made up 17.4 percent of the total workforce.
- From 2020 to 2021, the foreign-born labor force increased by 671,000. (https://www.bls.gov/news.release/pdf/forbrn.pdf)
- Sponsoring H1B visas is the primary way to acquire foreign professionals.

ds %>% group_by(year) %>% summarise(number=n(),group=1) %>% 
  ggplot(aes(factor(year),number, group=interaction(group), label=number))+
  geom_line(color="blue", size=1,)+
  geom_point(size=3)+
  geom_text(hjust=1.3, size=5)+
  ggtitle("H1B applications declined in 2016", subtitle = "Followed by tightend immigration policies (2017-2021) and Covid-19 (2020-2022).")+
  labs(x="Year", y="Number of Visa Applications\n")+
  theme_minimal()+
  theme(panel.grid.minor.y = element_blank())

Assumption

Companies do not have significant change in their operation and Human Resources policies after 2016.
The immigration policies remain the same under the current administration.
The demand in the labor market does not decline severely due to fluctuating economy.

Summary

Over 50% of visa-friendly companies are located in 6 states
- West: California
- Northeast: New York, New Jersey
- Southwest: Texas
- Southeast: Florida
- Midwest: Illinois
New Jersey and Florida have relatively lower wage offers given their high cost of living.
A Master’s degree gives you more opportunities in the job market. You can apply for 80% of jobs (2014-2016).
Top 10 hirers are Technology, Consulting and IT companies in California, Texas, Pennsylvania and Washington.
Apply to companies with higher visa certified rate.
Other visa-friendly industries are Advanced Manufacturing, Finance, Educational Services, and Health Care.

Analysis

Do you have friends or colleagues from other countries who work in the U.S.? What are their job titles?

According to the data between 2014 and 2016, your friends could be software engineers, system analysts, or senior managers in California, Washington or Oregon.

### What are populare job titles in the top 3 states? What about Washington (14th) and Oregon (26th)?
#install.packages("wordcloud")
library(wordcloud)

## Loading required package: RColorBrewer

#install.packages("RColorBrewer")
library(RColorBrewer)
#install.packages("wordcloud2")
library(wordcloud2)
#install.packages("tm")
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

#install.packages("patchwork")
library(patchwork)

# Create a function to generate a wordcloud based on user input

f.wdcloud <- function(Stateinput,Yearinput){
  WCtitle <- paste("Popular Job Titles"," (",Stateinput,",",Yearinput, ")" )

# Use user input to filter the data
ds1<- ds%>% filter(job_info_work_state==Stateinput, year==Yearinput)
#ds1

#Create a vector containing only the text
text <- unique(ds1$job_info_job_title)

# Create a corpus  
docs <-Corpus(VectorSource(text))

#Clean the text data
docs <-docs %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(stripWhitespace)
docs <- docs %>% tm_map(content_transformer(tolower))
docs <- docs %>% tm_map(removeWords, stopwords("english"))

# Create a document-term-matrix
dtm <- TermDocumentMatrix(docs)
matrix <-as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing = T) 
df<- data.frame(word= names(words),freq=words)

# Generate Word cloud

layout(matrix(c(1, 2), nrow=2), heights=c(1, 9))
par(mar=rep(0, 4), bg="white")
plot.new()+
text(x=0.5, y=0.5, WCtitle, col="black")
wordcloud(words = df$word, freq = df$freq, min.freq =10,max.words=200,
          random.order=F, rot.per=0.35,colors=brewer.pal(6, "Paired"))
}

# f.wdcloud("StateInput", "YearInput(2014-2016)")

# Top 1
f.wdcloud("CA","2016")

# Top 14
f.wdcloud("WA","2016")

# Top 26
f.wdcloud("OR","2016")

Where are companies that provide those jobs?

Visa-friendly companies are disproportionately located on the west and east coast.
The more companies in a state , the more job opportunities (before the pandemic and prevalence of working-from-home)
A median of 38% of jobs required a Master’s degree, and 42% of jobs required a Bachelor’s degree.

# Group the data by Employer_STATE and Count the number of unique EMPLOYER_NAME.

EmployerInEachState <- ds%>% group_by(employer_state) %>% 
  summarize(companies=length(unique(employer_name)))

EmployerInEachState<-as.data.frame(EmployerInEachState)
colnames(EmployerInEachState)<-c("state", "Companies")

# Create a sub dataset for job demands in each state by education level
jobsuply <- ds %>% group_by(job_info_work_state,job_info_education) %>% summarise(n=n()) 
jobsuply <- as.data.frame(jobsuply)

# Pivot the table wider and remove the column we don't need
jobsuply <- jobsuply %>% spread(job_info_education, n, fill=0)
jobsuply<-jobsuply[,1:8]


# Left_join EmployInEachState and jobsuply
EmployerInEachState <-left_join(EmployerInEachState, jobsuply, by= c("state"="job_info_work_state"))


jds<- EmployerInEachState %>% 
  mutate(jtotal=Associate+Bachelor+Doctorate+HighSchool+Master+None+Other,
         Ap=round(Associate/jtotal*100),
         Bp=round(Bachelor/jtotal*100),
         Dp=round(Doctorate/jtotal*100),
         HSp=round(HighSchool/jtotal*100),
         Mp=round(Master/jtotal*100), 
         Np=round(None/jtotal*100), 
         Otp=round(Other/jtotal*100),
         Cutcompanies=cut(Companies, breaks=c(-Inf,500,1000,2000,3000,4000,5000,Inf),
                          labels=c("<500","501-1000","1001-2000","2001-3000","3001-4000",
                                   "4001-5000",">5000")))

g <- list(scope = 'usa',projection = list(type = 'albers usa'),showlakes = F)

jds$hover <- with(jds, paste("Job Offers:",jtotal,'<br>', '<br>', "Education Requirement",'<br>', "Doctorate: ", Dp ,"%", '<br>',"Master's: ", Mp,"%", '<br>',"Bachelor's: ",Bp,"%"))

map <- plot_geo(jds,locationmode='USA-states')
map <- map %>% add_trace(z = ~Companies, text = ~hover, locations = ~state, color = ~Companies)

map <-map %>% layout(title="\nMore Visa-Friendly Companes in CA, NY & TX\nMaster's/Bachelor's Degrees Allow More Opportunitie\n(2014-2016)",geo=g)
map <- map %>% colorbar(title = "Number of\nCompanies")



htmlwidgets::saveWidget(
                widget = map, #the plotly object
                file = "6. USmap.html", #the path & file name
                selfcontained = TRUE #creates a single html file
                )

map

Top 26 States where visa-friendly companiess are located.

57.3% of visa friendly employers are in 6 states:
- California ——(19.6%)
- New York ——(11.8%)
- Texas ——(9.3%)
- New Jersey ——(6.4%)
- Florida ——(5.4%)
- Illinois ——(4.8%)
Washington (1.8%) ranks No.14, and Oregon (0.6%) ranks No.26.

ds6<-ds%>% group_by(employer_state) %>% summarise(n=length(unique(employer_name))) %>% arrange(desc(n))
ds6<-ds6 %>% mutate(cperct=round(n/sum(n),3)*100, cgroup=case_when(cperct>12~">12%",cperct>7~"7%~12%",cperct>4~"4%~7%", TRUE~"<= 4%"))
ds6<-ds6[1:26,]
ds6$cgroup<- factor(ds6$cgroup,levels = c(">12%","7%~12%","4%~7%","<= 4%"))
ds6$employer_state<- fct_reorder(ds6$employer_state,ds6$cperct)
ds6$rank<-c(1:26)
names(ds6)<-c("State","Companies", "Percentage","Group", "Rank")

ds6$State<- fct_rev(fct_reorder(ds6$State,ds6$Rank))



p3 <-ggplot(ds6)+
  geom_col(aes(x=Percentage, y=State,text=paste("Top",Rank,"(", Companies, "companies)"), fill=Group))+
  labs(x = "The Percentage (%) of Visa-Friendly Enterprise", y ="" , title = "57.3% of Visa-Friendly Companies are Located in 6 States. Oregon Ranks No.26\nCalifornia is No.1 and Washington No.14 (2014 - 2016)", fill="Percentage")+
  theme_minimal()+
  theme(legend.position =c(0.9,0.3), panel.grid.major.y = element_blank(), axis.text.y = element_text(color =c("magenta","black","black","black","black","black","black","black","black","black","black","black","magenta","black","black","black","black","black","black","black","black","black","black","black","black","magenta")))+
  scale_fill_viridis_d(option = "D",direction = -1)

#p3


gp3<- ggplotly(p3)

htmlwidgets::saveWidget(
                widget = gp3, #the plotly object
                file = "7. Top26 States.html", #the path & file name
                selfcontained = TRUE #creates a single html file
                )

gp3

Wage offers in the top 6 states

The average wage increases graduately while the difference between wage offers in each states remain unchanged.
Difference between wage offers:
- New York has the widest wage range around $36,000.
- $28,000~$29,000 in California, Texas, and Illinois.
- $20,000~$21,000 in New Jersey and Florida
Professionals with Master’s degrees receive lower wage offers in New Jersey and Florida despite higher cost of living.

ds7<-ds %>% filter(wage_offer_unit_of_pay_9089=="Year",
              foreign_worker_info_education=="Master",
              job_info_work_state %in% c("CA", "NY", "TX", "NJ","FL", "IL"), wage_offer_from_9089<200000) %>% 
  group_by(job_info_work_state, year) %>%
  summarise(avg.wage.lower.limit=round(mean(wage_offer_from_9089, na.rm=T)/1000),
            avg.wage.upper.limit=round(mean(wage_offer_to_9089,na.rm=T)/1000))
ds7$State<-factor(ds7$job_info_work_state,levels = c("CA","NY","TX","NJ","IL","FL"))              
ds7$year<-factor(ds7$year)
#ds7
ds7<-ds7 %>% gather(key=Limit, value = avgWage, c("avg.wage.lower.limit", "avg.wage.upper.limit"))
#ds7

p8<- ds7 %>% 
  ggplot(aes(year,avgWage, group=interaction(Limit),color=State,shape=State,linetype=Limit,label=avgWage))+
  geom_point(size=4)+
  geom_line(size=1)+
  geom_text(vjust=1.5,hjust=-0.15,size=2.5)+
  labs(x="", y="Mean Wage Offer ( $ thousand dollars / year)", title="Where Can You Get a Better Wage Offer (2014-2016)?", subtitle = "Professionals with Master's degrees receive relatively lower wage offers in New Jersey and Florida.")+ 
  scale_color_manual(values=c("#fde725","#20a486","#20a486","#3b528b","#3b528b","#3b528b"))+
  theme_bw()+
  theme(legend.position = "top")+
  facet_grid(~State)

p8

Who are top 10 hirers and their location?

COGNIZANT TECHNOLOGY is the No.1 hirer. It’s an American multinational IT service company founded in 1944 in India.
5 out of top 10 companies are in CA.
Hi-Tech, IT and Consulting companies are more visa-friendly.

nationtop10 <- ds %>% group_by(employer_name, employer_state) %>% summarise(n=n()) %>% arrange(desc(n))

nationtop10 <- nationtop10[1:10,]
nationtop10$employer_name<- fct_reorder(nationtop10$employer_name,nationtop10$n)

nationtop10$employer_state<-factor(nationtop10$employer_state,levels=c("CA","TX", "WA","PA"))
label1<-nationtop10[1,]

#nationtop10
ntop10<-nationtop10 %>% group_by(State=employer_state) %>% summarise('Visa Applications'=sum(n))

#nationtop10 %>% 
  #ggplot(aes(x=employer_state,y=n, fill=employer_name, label=n))+
  #geom_bar(stat = "identity", color="white")+
  #geom_text(position = position_stack(vjust = 0.5),color="white",size=5)+
  #geom_text(data=label1,position=position_stack(vjust=0.5),color="black",size=5)+
  #ggtitle("Cognizant Technology Solutions was No. 1 Visa Sponsor", subtitle = "5 of top 10 companies were in California")+
  #labs(x="", y="Number of Visa Applications", fill="Company")+
  #scale_fill_carto_d(palette = 4, direction = -1,labels = function(x) str_wrap(x, width = 20))+
  #theme(panel.background = element_blank(),axis.ticks.x = element_blank())

nationtop10 %>% 
  ggplot(aes(x= employer_name, y=n))+
  geom_bar(stat="identity", aes(fill =employer_state))+
  geom_text(aes(label=n), color="black", hjust=1, size=4)+
  ggtitle("Cognizant Technology Solutions is No. 1 Visa Sponsor", subtitle = "5 of the top 10 companies are in California (2014-2016)")+
  ylab("")+
  xlab("")+
  labs(fill="State")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
  scale_fill_viridis_d(option = "D", alpha = 0.7, direction = -1, limits=c("CA","TX","PA","WA"))+
  theme(legend.position =c(0.9,0.5),panel.background = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())+
  coord_flip()

table2<- gt(ntop10) %>% 
  tab_header(title="Around 23,000 H1B visas are sponsored by top 10 hirers in 4 states ", subtitle = "Based on the data between 2014 and 2016")


table2

State	Visa Applications
Around 23,000 H1B visas are sponsored by top 10 hirers in 4 states
Based on the data between 2014 and 2016
CA	9145
TX	8565
WA	4271
PA	1034

Among the top 10 companies, who has higher H1B visa certified rates?

Cognizant Technology has the highest visa certified rate of 64.8%. A higher certified rate means the company secures more visas for its employees.
Infosys,Microsoft,and Cisco have higher certified-expired rates, which means H1B visa status changes for various reasons, such as employee turnover or getting a green card. We need more data to understand the cause of higher visa expired rates.
Low withdrawn and denied rates mean the companies are more supportive to their employees during visa application process.

casepercentage<- ds %>% group_by(employer_name, case_status) %>% summarise(n=n()) %>% arrange(desc(n), employer_name)
casepercentage<- casepercentage %>% spread(case_status, n, fill = 0) %>% arrange(desc(Certified), desc('Certified-Expire'))
casepercentage<- casepercentage %>% rename("CertifiedExpired" = "Certified-Expired") # rename a column just make it easier
casepercentage<- casepercentage %>% mutate(total=Certified+CertifiedExpired+Denied+Withdrawn, Cp=round(Certified/total*100,2), CEp=round(CertifiedExpired/total*100,2), Dp=round(Denied/total*100,2), Wp=round(Withdrawn/total*100,2))

casepercentage<- casepercentage %>% arrange(desc(total), desc(Cp))
#casepercentage

# Create a subset with top 10 companies and  pivot the table longer for plotting 
casep1<- casepercentage[1:10,]
casep1<- gather(casep1, key = cs, value = csp, Cp:Wp)
casep1$cs <-factor(casep1$cs, levels = c("Cp","CEp","Wp","Dp"))
casep1$csplable <-as.character(casep1$csp)


casep2<-casep1[1:10,1:5]
casep2<-gather(casep2,key = cs,value = n,2:5)
casep2$cs<-factor(casep2$cs, levels=c("Denied","Withdrawn","CertifiedExpired","Certified"))
casep2$employer_name<-
  factor(casep2$employer_name,levels = c("CISCO SYSTEMS, INC.",
                                         "MICROSOFT CORPORATION",
                                         "INFOSYS LTD.",
                                         "AMAZON CORPORATE LLC",
                                         "DELOITTE CONSULTING LLP",
                                         "APPLE INC.",
                                         "GOOGLE INC.",
                                         "ORACLE AMERICA, INC.",
                                         "INTEL CORPORATION",
                                         "COGNIZANT TECHNOLOGY SOLUTIONS US CORPORATION"))

casep2<- casep2 %>% group_by(employer_name) %>% mutate(Percentage=round(n/sum(n)*100,1),perctlab=ifelse(Percentage>5,paste(Percentage,"% "),""))
#casep21<-casep2[1:40,] %>% mutate()
#casep2
names(casep2)<- c("Name","Status","Number","Percentage","perctlab")
p4<-casep2 %>% 
  ggplot(aes(y=Name,x=Percentage, fill=Status))+
  geom_bar(stat = "identity",color="white", position = "fill")+
  geom_text(aes(label=perctlab),position="fill",color="white", hjust=1, size=4.5)+
  ggtitle("Infosys,Microsoft,and Cisco Have Higher H1B Visa Expired Rates",)+
  labs(x="",y="",fill="Status", subtitle="H1B visa status in these companies changes for various reasons\n(2014 - 2016)")+
  scale_y_discrete(labels = function(y) str_wrap(y, width = 20))+
  scale_fill_manual(values = c("#0072B2","#CC79A7","#E69F00","grey50"),limits = c("Certified", "CertifiedExpired","Withdrawn","Denied"))+
  #scale_fill_viridis_d(option = "D",limits = c("Certified", "CertifiedExpired","Withdrawn","Denied"))+
  theme(panel.background = element_blank(),legend.position = "top", axis.text.x = element_blank(),axis.ticks = element_blank(),axis.text.y = element_text(color=c( "#AA4499","#AA4499","#AA4499","black","black","black","black","black","black","black")))

p4

Interested in other industries?

Advanced Manufacturing, Finance, Educational Services, and Health Care industries could be your options.

ds8<- ds1 %>% filter(year>2013, !is.na(us_economic_sector)) %>% group_by(us_economic_sector) %>% summarise(number=n()) %>% mutate(perct=round(number/sum(number)*100,2)) %>% arrange(desc(perct))
names(ds8)<-c("Economic Sector", "Visa Applications", "Percentage" )

table1<- gt(ds8) %>% 
  tab_header(title="IT industry is the most visa friendly.", subtitle = "Based on the available data between 2014 and 2016")

table1

Economic Sector	Visa Applications	Percentage
IT industry is the most visa friendly.
Based on the available data between 2014 and 2016
IT	19806	42.26
Advanced Mfg	6935	14.80
Other Economic Sector	5098	10.88
Finance	3843	8.20
Educational Services	3623	7.73
Health Care	2378	5.07
Retail	1714	3.66
Aerospace	1606	3.43
Energy	404	0.86
Automotive	329	0.70
Transportation	309	0.66
Construction	297	0.63
Hospitality	189	0.40
Biotechnology	181	0.39
Agribusiness	63	0.13
Geospatial	58	0.12
Homeland Security	39	0.08