Introduction: In recent years, everyone is talking about Data Science and Big data to analyze and utilize information to understand many trends.
Data Scientist is becoming a crucial role in the data driven organizations. These companies are hiring data science skills on high salaries. So, I got interested in the data scientist salary trend and how it will be for next year. I have chose to work on H1-B dataset for this purpose.
H1-B is the non-immigrant visa program. Every year United States employers applies for high skilled foreign nationals to come in the US and work for their company. H1-B database contains some very important information such as position wages, work location etc. In this analysis, I am only predicting prevailing wages and not wage_1. To understand different attributes in detail, please refer data dictionary at http://www.flcdatacenter.com/CaseH1B.aspx. I have downloaded this data from http://enigma.io/.
So, let’s get started.
First question is which state is applying for the maximum H1-B jobs
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(ggplot2))
suppressMessages(library(scales))
#set directory of h1b files as working directory
ls_files<-dir()
files<-grep(pattern ="^h1b 20",ls_files,value = TRUE)#value true will print file names ow index
h1b_all<-sapply(files,function(x)read.csv(x,stringsAsFactors=FALSE,header=TRUE))
df<-ldply(.data = h1b_all)
df<-df[grep('Certified|CERTIFIED',df$case_status),]
location<-df[,"state_1"]
plot<-sort(table(location),decreasing = TRUE)
Most_h1b_state<-head(data.frame(State=names(plot),Freq=plot),10)
ggplot(data=Most_h1b_state,aes(State,Freq,size=Freq))+geom_point(pch=21,col="salmon",bg="salmon")+scale_size_area()+ggtitle("H1-B Jobs Certified 2006 - 2015\n")+scale_y_continuous(name="Frequency",labels=comma)
California applied about 600,000 H1-B jobs in 10 year.
Now, sub setting only Data scientist positions from the database.
df$job_title<-iconv(enc2utf8(df$job_title),sub="byte")#Converts the strings in a character vector from one string encoding to another.
df$job_title<-tolower(df$job_title)#small alphabets
head(df$job_title)
## [1] "systems analyst/consultant" "programmer analyst"
## [3] "software engineer" "staff software engineer"
## [5] "programmer / analyst" "programmer analyst"
str<-c('data scientist|Data Scientist|Data scientist|data Scientist')#or sign will go through in either or option and won't result into warning message
data_job<-df[grep(str,df$job_title),]
#data_job<-subset(df,grep(pattern=str,df$job_title))#does not work
#data_job<-filter(df,grepl(c('data scientist','Data Scientist','Data scientist','data Scientist',job_title))#does not work
part_time<-filter(data_job,grepl('Y',part_time_1))
full_time<-filter(data_job,grepl('N',part_time_1))
full_time<-full_time[full_time$rate_per_1=="Year",]
(table(part_time$state_1))
##
## CA NY VA
## 2 1 1
#California, NY and Virginia placing more part time data science jobs
boxplot(part_time$wage_rate_1,names=c("Part Time"),
col=c("thistle"),
xlab="Paid Salary",
main="H1-B Data Scientist Part Time salary")
California has maximum full fime and part time data scientist jobs following New York and Massachusetts.
Part time jobs median job rate is $55 per hour.
which.max(full_time$prevailing_wage_1)#85
## [1] 85
full_time[85,]#prevailing_wage_1 $157435
## .id year case_no employer_name
## 85 h1b 2012.csv 2012 I-200-12194-912882 NETFLIX, INC.
## address address_2 city state zip_code
## 85 100 WINCHESTER CIRCLE <NA> LOS GATOS CA 95032
## job_title job_code case_status wage_rate_1 rate_per_1
## 85 senior data scientist 15-1111 CERTIFIED 157435 Year
## part_time_1 city_1 state_1 prevailing_wage_1 naics_code
## 85 N LOS GATOS CA 157435 532230
which.min(full_time$prevailing_wage_1)#250
## [1] 406
full_time[250,]#prevailing_wage_1 $85446
## .id year case_no
## 251 h1b 2013.csv 2013 I-200-13234-002138
## employer_name address
## 251 MAGNETIC MEDIA ONLINE, INCORPORATED 311 W. 43RD STREET, SUITE 1406
## address_2 city state zip_code job_title job_code
## 251 <NA> NEW YORK NY 10036 senior data scientist 15-1132
## case_status wage_rate_1 rate_per_1 part_time_1 city_1 state_1
## 251 CERTIFIED 130000 Year N NEW YORK NY
## prevailing_wage_1 naics_code
## 251 85446 541890
median(full_time$prevailing_wage_1)#91926
## [1] 91926
head(sort(table(full_time$city_1),decreasing=T),10)
##
## SAN FRANCISCO MOUNTAIN VIEW NEW YORK MENLO PARK PALO ALTO
## 133 95 79 49 47
## SAN JOSE CAMBRIDGE BELLEVUE REDMOND BOSTON
## 26 21 19 19 18
Median income for data scientist job is $91,926. San Francisco, Mountain View, New York, Menlo Park and Palo alto are the top five cities offering full time jobs.
Now looking at the year-to-year salary difference in data scientist.
ndf<-full_time[,c('year','employer_name','city','state','job_title','state_1','wage_rate_1','prevailing_wage_1','naics_code')]
nmatrix<-ddply(ndf,~year,summarize, median=median(prevailing_wage_1))
ggplot(data=nmatrix,aes(year,median))+geom_smooth(method="lm")+ggtitle("H1-B Data Scientist median full time salary 2009 - 2015\n")
Predict 2016 Data Scientist salary
lm<-lm(median~year,data=nmatrix)
summary(lm)
##
## Call:
## lm(formula = median ~ year, data = nmatrix)
##
## Residuals:
## 1 2 3 4 5 6 7
## 1111 -4488 1214 6896 -1443 -6882 3592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7148563.6 1959252.5 -3.649 0.0148 *
## year 3597.3 973.8 3.694 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5153 on 5 degrees of freedom
## Multiple R-squared: 0.7319, Adjusted R-squared: 0.6782
## F-statistic: 13.65 on 1 and 5 DF, p-value: 0.01408
data.frame(nmatrix , fitted.value= fitted (lm), residual= resid (lm))
## year median fitted.value residual
## 1 2009 79602.0 78491.05 1110.946
## 2 2010 77600.0 82088.39 -4488.393
## 3 2011 86900.0 85685.73 1214.268
## 4 2012 96179.0 89283.07 6895.929
## 5 2013 91437.0 92880.41 -1443.411
## 6 2014 89596.0 96477.75 -6881.750
## 7 2015 103667.5 100075.09 3592.411
predict(lm,interval="confidence")
## fit lwr upr
## 1 78491.05 69465.68 87516.43
## 2 82088.39 75008.31 89168.48
## 3 85685.73 80088.43 91283.03
## 4 89283.07 84276.69 94289.45
## 5 92880.41 87283.11 98477.71
## 6 96477.75 89397.66 103557.84
## 7 100075.09 91049.71 109100.47
Let’s plug the value in the formula and calculate salary for 2016. Y=a+bX
In our example, Y is median data scientist salary - a dependent variable, a is intercept with Y, X is year - an independent variable and b is coefficient of X.
To predict 2016 data scientist salary,
-7148563.6+3597.3*2016
## [1] 103593.2
So, Data Scientist’s estimated salary for 2016 is $103,593.2 with 95% confidence interval.