When a US company wants to hire someone from outside of the United States for a technical position, they need to file an application to United States government to get a green card or visa for the foreign applicant. To show equity for US and non-US employees, the companies need to state how much they are willing to pay for employees when they submit a visa or green card application. At the meantime, they need to provide the average amount, which named “prevailing wage,” an employee with similar skills and background typically gets paid for the same position.
The difference between paid wage and prevailing wage could represent whether US companies are willing to pay more salary to non-US employees. More pay for the potential foreign employees will be attractive.
For different area and job, the salary could show different, figure out the relationship between salary, area and position could helpful for non-US employees to choosing employment in the US.
The raw data contains 167,278 records from the Lobar Condition Applications the permanent resident applications from 2008 to 2015. The VISA class concludes five different types: “green card”, “H-1B”, “H-1B1 Chile”, “H- 1B1 Singapore” and “E-3 Australia”. For this project, I select “H-1B” VISA class for analysis. In this case, the data contains 154,497 records which VISA class is all H-1B. The original data was compiled by the US Department of Labor’s Office of Foreign Labor Certification.
First, we need to load our data.
Let’s check the basic information for this data.
## Observations: 167,212
## Variables: 13
## $ CASE_STATUS <chr> "denied", "denied", "certified", "ce...
## $ CASE_RECEIVED_DATE <chr> "3/19/2015", "9/16/2014", "9/3/2014"...
## $ DECISION_DATE <chr> "3/19/2015", "9/23/2014", "9/9/2014"...
## $ EMPLOYER_NAME <chr> "SAN FRANCISCO STATE UNIVERSITY", "S...
## $ JOB_TITLE <chr> "Assistant Professor of Marketing", ...
## $ WORK_CITY <chr> "SAN FRANCISCO", "PORTLAND", "CHARLO...
## $ PREVAILING_WAGE_SOC_TITLE <chr> "Business Teachers, Postsecondary", ...
## $ WORK_STATE <chr> "California", "Oregon", "North Carol...
## $ FULL_TIME_POSITION_Y_N <chr> "NA", "y", "y", "y", "y", "n", "y", ...
## $ VISA_CLASS <chr> "greencard", "E-3 Australian", "H-1B...
## $ PREVAILING_WAGE_PER_YEAR <dbl> NA, NA, 27934.0, 36720.0, 40450.0, 4...
## $ PAID_WAGE_PER_YEAR <dbl> 91440.00, 170000.00, 77952.00, 36720...
## $ JOB_TITLE_SUBGROUP <chr> "assistant professor", "software eng...
We know we have 13 variables and 167,212 entries.
Using bar plot to show the order.We could check following bar plot. Figure 1. The VISA counts in different stats(histogram).
Now, we map these data on U.S map. Figures 2. The VISA counts in different states(map).
From these two plots, we could find the job amount for each state is quite different. So, we could Sort these states as five different areas. The six areas are West, Midwest, Southwest, Southeast, Northeast, and Other. For other area, it contains contains Hawaii, Puerto Rico, Virgin Islands, Guam, Northern Mariana Islands, and Palau.
Table1
Basic count information for different area.
Now, we could check the same statistics for different Job Titles.
Let’s check the bar plot for for Job amount for different Job Title Subgroup. Figure3. The VISA counts for different job subgroup.
Table2
The counts and percentage information for different Job Title Subgroup.
In this project, my data size is 154,497, according to AREA, I could split the data into six different parts which are West, Midwest, Southwest, Northeast and other. For OTHER AREA, it contains Hawaii, Puerto, Rico, Virgin Islands, Guam, Northern Mariana Island and Palau. The sample size for OTHER AREA is 482, and the percentages is 0.312%. Because the amounts and percentage are small, also the locations in other AREA is not very common for non-US employees. I do not consider OTHER AREA in this project.
For this project, I will focus on two variables.
Before I start data analysis, I want to point out one issue which this data set might have. The sample size for this project is large. In this situation, one problem will be statistical and practical significance. The difference between a sample statistic and hypothesized value is statistically significant if a hypothesis test indicates it is too unlikely to have occurred by chance. For large sample size, will cause all p-values suggest us to reject the hypothesis. So, the first thing for this project should be decreasing the sample size. I use power analysis to calculate the sample size I need to use for this paper.
For power analysis, we should know the standard deviation for sample. Before this step, we need to figure out which variable we should to use in our ANOVA model.
Figure 4. he distribution of Paid Wage Per Year. For this plot, we could find the data is skewed right. The distribution is clearly not normal also we could tell the standard deviation for this distribution is be large. The standard deviation for this distribution is 49963.6805.
If we use this standard deviation for power analysis, we will need a lot of samples. This choice is not right for us because we want to avoid the problem of “practical significance versus statistical significance” when the sample size is large. The solution for this issue is that we could choose to use log of data. Therefore, we will get a distribution with a smaller standard deviation shown as following plot.
Figure 5. The distribution for log of Paid Per Year. The shape looks like a bell shape.
For a better result, we check the standard deviation for each group. As following table’s information, we could choose standard deviation equals 0.5 for power analysis. We could check following standard deviation table.
Table3
The Standard Deviation information for each group
| Area | Assistant Professor | Attorney | Business Analyst | Data Analyst | Data Scientist | Management | Softer Engineer | Teacher |
|---|---|---|---|---|---|---|---|---|
| MIDWEST | 0.49 | 0.40 | 0.23 | 0.21 | 0.24 | 0.43 | 0.20 | 0.23 |
| NORTHEAST | 0.44 | 0.48 | 0.23 | 0.25 | 0.33 | 0.40 | 0.25 | 0.38 |
| SOUTHEAST | 0.51 | 0.42 | 0.21 | 0.18 | 0.25 | 0.38 | 0.21 | 0.22 |
| SOUTHWEST | 0.52 | 0.47 | 0.24 | 0.22 | 0.30 | 0.42 | 0.27 | 0.19 |
| WEST | 0.41 | 0.42 | 0.24 | 0.27 | 0.22 | 0.38 | 0.25 | 0.28 |
Figure 6. The plot of power and sample size. For this plot, we could tell the power’s range is from 0.3 to 1 for different sample sizes. Based on this plot, if we want to get a higher power for our test, we could choose sample size around at 2000. When sample size is larger than 1600, we could get a power bigger than 0.9.
Table4
The GLMPOWER Procedure
Table5
The GLMPOWER Procedure
Based on these tow tables, we could summarize we will get a very high power when we choose sample equals around 2000. The power is close to 1.
After we know how many samples we need to use, we can do sampling, in this case, I choose to use simple random sampling method. I select 50 samples for each Job Title in each AREA. The total sample size is 1750. For this sample size, we could recheck Figure 6; we still could get a high power when we have 1750 samples. We know the sample size for each cell is same. Therefore, we could build a balanced two-way ANOVA model.
The ANOVA model has two assumptions.
We can check normality by following histogram plot.
Figure 7. The histogram for residuals. We could tell the shape for this distribution show a bell shape. It looks normal.
Figure 8. The Q-Q plot for residuals. We could find some outliers in lower and higher x range. Most of points area around the straight line. We can conclude the the residuals are normal.
Then, we need to test equality for variance by Levene’s test and we have the following hypothesis.
H0: The variances are equal.VS Ha: The variances are not equal.
Table 6
Levene’s Test result.
Note. The small p-value suggested us reject H0
We passed normality assumptions, however, we did not pass variance assumptions.
Table 7
ANOVA table.
Note. The small p-value for interaction term suggests us reject H0: No interaction term VS Ha: Interaction term exist. For this model, we have a significant interaction. In this case, we do not need to consider the main effect.
This table shows the mean for different Job in different Area. We want to compare the cell mean, therefore, we need to sort them by AREA and compare the cell mean in each area.
The job title in the same color indicated they are in the same group and their means are not significantly different. We could find in the various areas; teacher showed significant differences for rest job title. Attorney and management consultant’s salary area top two in each area. Now, we could compare each job in the different area.
For these comparisons, we find Business Analysis’s salary show no significant differences in the different area. For Teacher, Software Engineer, Management Consultant, Data related job and Business analyst, the WEST area’s salary is highest.
After we finished the salary part, we could continue the difference between Paid Wage Per Year and Prevailing Wage Per Year. As I mentioned before, I created a new variable DIFF. For this variable, we check the distribution first.
Figure 9. The distribution for difference. From this plot, we could clearly find most of the data is scattered around 0 and show right skewed.
Figure 10: The scatter plot for a different job in each area. We could find most of the data is distributed between 0 and 150000. Only 14 data points are below 0(the red line). The Assistant Professor has a few extreme high-value positions in Midwest, Northeast, and Southeast. Also, in each area, Assistant Professor has some points which are lower than 0.
For each job in the different area, the range also shows different from very small to very large. The difference between paid wage and prevailing wage for business analyst, management consultant, data related job, software engineer and teacher shows a small range. Comparing to these jobs, assistant professor and attorney show a significant variety. Also, the gray shadow represents the 95% confidence interval (CI) for the mean. In this plot, the CI for an assistant professor, attorney, and management consultant is large than rest jobs in each area. We could check more details about CI in the following table 9.
For DIFF, I have following hypothesis.
H0: mean of each job in different area > 0 VS. HA: mean of each job in different area <= 0
The sample size for each cell is 50, so based on Central Limit Theorem, I choose to use one side t-test for this hypothesis test.
Based on p-value given by R, we could not reject H0 in each area with different job. We can conclude the mean in each area with different job is large than 0. The company are willing to pay a higher wage for non-US employees for Technic job. Then, we could check 95% Confidence Interval for each cell mean.
Table 9Note. For this table, we could compare CI in each area. In each area, the CI’s range is different. We find the range of confidence interval for Assistant Professor is biggest in four areas which are Midwest, Northeast, Southeast and Southwest. The teacher’s CI range is smallest in each area.
For H-1B visa holders, their salary shows different in each area and jobs. Using two-way ANOVA model help us find the interaction of AREA and JOB TITLE SUBGROUP is significant when we were analyzing H-1B VISA salary. We know in the different area, some of different job’s mean show no significant difference so that we could consider them as one group. Some of them show significant difference with others. Also, we know, in the different area, the company are willing to pay a high salary than the prevailing wage for non-US employees. This indicated non-US employees could have a decent salary when they have a Technic job.
Although this paper got, some useful results could help us know the H-1B holder’s salary’s information. But we still could consider more variables when we build the future model. We might take age, sex and education level into consideration. When we consider more variables, we might want to build a MANOVA model to check where interactions exist or the main effect. In this paper, I use simple random sampling to avoid the problem of statistical and practical significance. In future research, if we could use other advanced statistical method or model, we might use raw data to do data analysis might help for a better result.