We start by loading the salary dataset given in salary_data.csv file and preparing the workspace.
Let’s take a peek at the data.
saldata[1:10, "EMPLOYER_NAME"]
## [1] "ADVANCED TECHNOLOGY GROUP USA, INC."
## [2] "SAN FRANCISCO STATE UNIVERSITY"
## [3] "CAROUSEL SCHOOL"
## [4] "HARLINGEN CONSOLIDATED INDEPENDENT SCHOOL DISTRICT"
## [5] "SIGNAL SCIENCES CORPORATION"
## [6] "CAPGEMINI U.S. LLC"
## [7] "PURE STORAGE, INC."
## [8] "POLMAK, INC."
## [9] "GOOGLE INC."
## [10] "STLPORT CONSULTING, INC."
Next we look at the type of visas and the salary ranges in our dataset.
table(saldata$VISA_CLASS)
##
## E-3 Australian greencard H-1B H-1B1 Chile
## 1393 11093 154497 147
## H-1B1 Singapore
## 148
summary(saldata$PREVAILING_WAGE_SUBMITTED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 54500 69700 71200 89900 6220000
Number of records by job-types.
table(saldata$JOB_TITLE_SUBGROUP)
##
## assistant professor attorney business analyst
## 18866 1488 27811
## data analyst data scientist management consultant
## 3840 1227 770
## software engineer teacher
## 99364 13912
Next, we will remove columns with more than 70% missing values and create some derived variables. (Code not shown)
We will also create a new Date variable in a standard format.
saldata$case_date = as.Date(saldata$CASE_RECEIVED_DATE, "%m/%d/%Y")
saldata$caseyr = as.numeric(format(saldata$case_date, "%Y"))
2. Visualize number of visas by year and job title.
setctstate = sqldf("select count(*) as 'count',
EMPLOYER_NAME as 'employer',
WORK_STATE2 as 'state',
avg(PAID_WAGE_PER_YEAR) as 'salary'
from activesal
group by EMPLOYER_NAME, WORK_STATE2")
#top_setctstate = setctstate[with(setctstate, order(state, -count, -salary)),]
top_setctstate = setctstate[with(setctstate, order( -count)),]
head(top_setctstate)
## count employer state salary
## 11321 4195 GOOGLE INC. california 126006
## 17765 1112 MICROSOFT CORPORATION washington 121732
## 22106 882 QUALCOMM TECHNOLOGIES, INC. california 94285
## 9691 806 FACEBOOK, INC. california 129203
## 13934 732 INTEL CORPORATION california 101627
## 5226 700 CERNER CORPORATION missouri 69044
2)Let us see which states offer the highest salary.
## state salary
## 1 dc 2400000
## 2 new york 2400000
## 3 alabama 1400000
## 4 ohio 600000
## 5 california 582400
## 6 texas 542400
## 7 massachusetts 528000
## 8 kansas 475000
## 9 new york 428333
## 10 nevada 385000
3) Analyze how software engineers fared! A topic (obviously) close to my heart! We create a dataframe called setstatesal for this analysis.
setstatesal = sqldf("select caseyr as 'year',
count(*) as 'count',
JOB_TITLE_SUBGROUP as 'title',
WORK_STATE2 as 'state',
max(PAID_WAGE_PER_YEAR) as 'maxsal',
avg(PAID_WAGE_PER_YEAR) as 'avgsal',
min(PAID_WAGE_PER_YEAR) as 'minsal'
from activesal
group by caseyr, JOB_TITLE_SUBGROUP, WORK_STATE2 ")
2) Salary by year and avg_salary. From graph below, we conclude that Attorneys make the highest salaries, which is not very surprising! Also, 2012 was the best year (salary-wise) for foreign software engineers.
3) Interactive Map using Leaflet. We create an interactive map to show Top employers in every state. Click on the blue marker to see info. The numbered circles are clickable links that zoom into the map for a better view.
That concludes our analysis for now. Did you like this tutorial? If so, please don’t forget to share!