Punam Katariya
Oct 12, 2015

Introduction: In recent years, everyone is talking about Data Science and Big data to analyze and utilize information to understand many trends.

Data Scientist is becoming a crucial role in the data driven organizations. These companies are hiring data science skills on high salaries. So, I got interested in the data scientist salary trend and how it will be for next year. I have chose to work on H1-B dataset for this purpose.

H1-B is the non-immigrant visa program. Every year United States employers applies for high skilled foreign nationals to come in the US and work for their company. H1-B database contains some very important information such as position wages, work location etc. In this analysis, I am only predicting prevailing wages and not wage_1. To understand different attributes in detail, please refer data dictionary at http://www.flcdatacenter.com/CaseH1B.aspx. I have downloaded this data from http://enigma.io/.

So, let’s get started.

First question is which state is applying for the maximum H1-B jobs

suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(ggplot2))
suppressMessages(library(scales))
#set directory of h1b files as working directory 
ls_files<-dir()
files<-grep(pattern ="^h1b 20",ls_files,value = TRUE)#value true will print file names ow index
h1b_all<-sapply(files,function(x)read.csv(x,stringsAsFactors=FALSE,header=TRUE))
df<-ldply(.data = h1b_all)
df<-df[grep('Certified|CERTIFIED',df$case_status),]
location<-df[,"state_1"]
plot<-sort(table(location),decreasing = TRUE)
Most_h1b_state<-head(data.frame(State=names(plot),Freq=plot),10)
ggplot(data=Most_h1b_state,aes(State,Freq,size=Freq))+geom_point(pch=21,col="salmon",bg="salmon")+scale_size_area()+ggtitle("H1-B Jobs Certified 2006 - 2015\n")+scale_y_continuous(name="Frequency",labels=comma)

df$job_title<-iconv(enc2utf8(df$job_title),sub="byte")#Converts the strings in a character vector from one string encoding to another.
df$job_title<-tolower(df$job_title)#small alphabets
head(df$job_title)
## [1] "systems analyst/consultant" "programmer analyst"        
## [3] "software engineer"          "staff software engineer"   
## [5] "programmer / analyst"       "programmer analyst"
str<-c('data scientist|Data Scientist|Data scientist|data Scientist')#or sign will go through in either or option and won't result into warning message
data_job<-df[grep(str,df$job_title),]
#data_job<-subset(df,grep(pattern=str,df$job_title))#does not work
#data_job<-filter(df,grepl(c('data scientist','Data Scientist','Data scientist','data Scientist',job_title))#does not work
part_time<-filter(data_job,grepl('Y',part_time_1))
full_time<-filter(data_job,grepl('N',part_time_1))
full_time<-full_time[full_time$rate_per_1=="Year",]

(table(part_time$state_1))
## 
## CA NY VA 
##  2  1  1
#California, NY and Virginia placing more part time data science jobs

boxplot(part_time$wage_rate_1,names=c("Part Time"),
        col=c("thistle"),
        xlab="Paid Salary",
        main="H1-B Data Scientist Part Time salary")

which.max(full_time$prevailing_wage_1)#85
## [1] 85
full_time[85,]#prevailing_wage_1 $157435
##             .id year            case_no employer_name
## 85 h1b 2012.csv 2012 I-200-12194-912882 NETFLIX, INC.
##                  address address_2      city state zip_code
## 85 100 WINCHESTER CIRCLE      <NA> LOS GATOS    CA    95032
##                job_title job_code case_status wage_rate_1 rate_per_1
## 85 senior data scientist  15-1111   CERTIFIED      157435       Year
##    part_time_1    city_1 state_1 prevailing_wage_1 naics_code
## 85           N LOS GATOS      CA            157435     532230
which.min(full_time$prevailing_wage_1)#250
## [1] 406
full_time[250,]#prevailing_wage_1 $85446
##              .id year            case_no
## 251 h1b 2013.csv 2013 I-200-13234-002138
##                           employer_name                        address
## 251 MAGNETIC MEDIA ONLINE, INCORPORATED 311 W. 43RD STREET, SUITE 1406
##     address_2     city state zip_code             job_title job_code
## 251      <NA> NEW YORK    NY    10036 senior data scientist  15-1132
##     case_status wage_rate_1 rate_per_1 part_time_1   city_1 state_1
## 251   CERTIFIED      130000       Year           N NEW YORK      NY
##     prevailing_wage_1 naics_code
## 251             85446     541890
median(full_time$prevailing_wage_1)#91926
## [1] 91926
head(sort(table(full_time$city_1),decreasing=T),10)
## 
## SAN FRANCISCO MOUNTAIN VIEW      NEW YORK    MENLO PARK     PALO ALTO 
##           133            95            79            49            47 
##      SAN JOSE     CAMBRIDGE      BELLEVUE       REDMOND        BOSTON 
##            26            21            19            19            18
ndf<-full_time[,c('year','employer_name','city','state','job_title','state_1','wage_rate_1','prevailing_wage_1','naics_code')]

nmatrix<-ddply(ndf,~year,summarize, median=median(prevailing_wage_1))
ggplot(data=nmatrix,aes(year,median))+geom_smooth(method="lm")+ggtitle("H1-B Data Scientist median full time salary 2009 - 2015\n")


Predict 2016 Data Scientist salary

lm<-lm(median~year,data=nmatrix)
summary(lm)
## 
## Call:
## lm(formula = median ~ year, data = nmatrix)
## 
## Residuals:
##     1     2     3     4     5     6     7 
##  1111 -4488  1214  6896 -1443 -6882  3592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -7148563.6  1959252.5  -3.649   0.0148 *
## year            3597.3      973.8   3.694   0.0141 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5153 on 5 degrees of freedom
## Multiple R-squared:  0.7319, Adjusted R-squared:  0.6782 
## F-statistic: 13.65 on 1 and 5 DF,  p-value: 0.01408
data.frame(nmatrix , fitted.value= fitted (lm), residual= resid (lm))
##   year   median fitted.value  residual
## 1 2009  79602.0     78491.05  1110.946
## 2 2010  77600.0     82088.39 -4488.393
## 3 2011  86900.0     85685.73  1214.268
## 4 2012  96179.0     89283.07  6895.929
## 5 2013  91437.0     92880.41 -1443.411
## 6 2014  89596.0     96477.75 -6881.750
## 7 2015 103667.5    100075.09  3592.411
predict(lm,interval="confidence")
##         fit      lwr       upr
## 1  78491.05 69465.68  87516.43
## 2  82088.39 75008.31  89168.48
## 3  85685.73 80088.43  91283.03
## 4  89283.07 84276.69  94289.45
## 5  92880.41 87283.11  98477.71
## 6  96477.75 89397.66 103557.84
## 7 100075.09 91049.71 109100.47

Interpretation:
  1. As year increases by one data scientist median salary increases by $3,597.30.
  2. p-value - Summary shows a borderline significance(0.05) in this model.
  3. Adjusted R-squared - The model predicts 67.82% variation correctly, which is pretty good.

Let’s plug the value in the formula and calculate salary for 2016. Y=a+bX

In our example, Y is median data scientist salary - a dependent variable, a is intercept with Y, X is year - an independent variable and b is coefficient of X.

To predict 2016 data scientist salary,

-7148563.6+3597.3*2016
## [1] 103593.2

So, Data Scientist’s estimated salary for 2016 is $103,593.2 with 95% confidence interval.