Is your name related to how much you get paid? How much more is person AAA paid compared to ZZZ?
Salary is NOT correlated with the alphabetical order of one’s last name. That is professor ABCs get paid similar to professor XYZs.
Male and femamle workers are paid the same for the same job.
The correlation between the alphabetical order of one’s last name and his/her salary is positive and significant. Compared to last names started with A to E, the employees with other last names get paid less, around $900 to $1,500 less per year.
Female professors get paid $6,000 less than males controlling job title, employer and the alphabetical order of the last name.
Compared to assistant professor, the salary premium of the Associate Professor and Professor is $12,500 and $36,500 per year, respectively.
More details and my R code (click CODE to display) follows.
Ontario Public sector salary disclosure: individual level data includes:
d<-read_rds("pssd_1996_2017.rds")
d$Calendar.Year<-as.factor(d$Calendar.Year)
names(d)
## [1] "Sector" "Last.Name" "First.Name"
## [4] "Salary.Paid" "Taxable.Benefits" "Employer"
## [7] "Job.Title" "Calendar.Year"
The data for each year are downloaded and combined together. I also generated new variables Last.Name.Start. It is a factor indicates whether someone’s surname initial is in A-E, F-J, K-O, P-T, or T-Z.
d<-d %>%
mutate(First.Name.start=substr(First.Name,1,1),
Last.Name.start=substr(Last.Name,1,1)) %>%
filter(First.Name.start %in% c(LETTERS) &
Last.Name.start %in% c(LETTERS)) %>%
mutate(First_Name=case_when(First.Name.start %in% LETTERS[1:5] ~ "A-E",
First.Name.start %in% LETTERS[6:10] ~ "F-J",
First.Name.start %in% LETTERS[11:15] ~ "K-O",
First.Name.start %in% LETTERS[16:20] ~ "P-T",
First.Name.start %in% LETTERS[20:26] ~ "T-Z"
)) %>%
mutate(Last_Name=case_when( Last.Name.start %in% LETTERS[1:5] ~ "A-E",
Last.Name.start %in% LETTERS[6:10] ~ "F-J",
Last.Name.start %in% LETTERS[11:15] ~ "K-O",
Last.Name.start %in% LETTERS[16:20] ~ "P-T",
Last.Name.start %in% LETTERS[20:26] ~ "T-Z"
))
This preliminary analysis is on individuals with a job title Assistant Professor, Associate Professor, or Professor.
Here are the regression results.
Last Name A - E is the reference group. Compared to this group, people with other last names (lower by alphabetical order) get paid less.
Column (2) adds the initial of the first name as an explanatory variable. It’s is not significant. But the initials of the last name is still significant.
d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
Job.Title+
Employer+
Calendar.Year
)
md.add.first.name <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
First_Name+
Job.Title+
Employer+
Calendar.Year)
stargazer(md,md.add.first.name,
omit = "Employer",
omit.labels = "Employer",
intercept.bottom = F,
df = FALSE,
type="html")
| Dependent variable: | ||
| Salary.Paid | ||
| (1) | (2) | |
| Constant | 58,902.860*** | 58,818.970*** |
| (3,413.797) | (3,416.030) | |
| Last_NameF-J | -1,386.548*** | -1,385.900*** |
| (227.773) | (227.823) | |
| Last_NameK-O | -828.753*** | -829.437*** |
| (207.007) | (207.074) | |
| Last_NameP-T | -322.248 | -320.559 |
| (211.311) | (211.362) | |
| Last_NameT-Z | -1,108.991*** | -1,106.210*** |
| (276.599) | (276.812) | |
| First_NameF-J | 120.286 | |
| (211.740) | ||
| First_NameK-O | 28.464 | |
| (209.901) | ||
| First_NameP-T | 206.241 | |
| (204.896) | ||
| First_NameT-Z | -16.824 | |
| (341.493) | ||
| Job.TitleAssociate Professor | 11,846.280*** | 11,844.550*** |
| (276.308) | (276.334) | |
| Job.TitleProfessor | 34,952.760*** | 34,947.670*** |
| (286.153) | (286.279) | |
| Calendar.Year1998 | 4,219.026 | 4,233.896 |
| (3,049.475) | (3,049.597) | |
| Calendar.Year1999 | 3,905.832 | 3,919.749 |
| (2,997.917) | (2,998.025) | |
| Calendar.Year2000 | 4,869.895* | 4,882.420* |
| (2,914.033) | (2,914.176) | |
| Calendar.Year2001 | 6,515.093** | 6,524.331** |
| (2,911.235) | (2,911.369) | |
| Calendar.Year2002 | 7,750.415*** | 7,759.718*** |
| (2,882.859) | (2,882.988) | |
| Calendar.Year2003 | 12,559.320*** | 12,568.870*** |
| (2,848.803) | (2,848.949) | |
| Calendar.Year2004 | 7,839.397*** | 7,852.529*** |
| (2,847.962) | (2,848.176) | |
| Calendar.Year2005 | 19,297.630*** | 19,311.460*** |
| (2,826.526) | (2,826.723) | |
| Calendar.Year2006 | 21,915.240*** | 21,929.960*** |
| (2,826.652) | (2,826.863) | |
| Calendar.Year2007 | 28,603.920*** | 28,617.230*** |
| (2,819.952) | (2,820.166) | |
| Calendar.Year2008 | 32,577.590*** | 32,590.420*** |
| (2,816.833) | (2,817.037) | |
| Calendar.Year2009 | 37,890.150*** | 37,904.000*** |
| (2,812.323) | (2,812.534) | |
| Calendar.Year2010 | 41,064.560*** | 41,077.630*** |
| (2,810.306) | (2,810.519) | |
| Calendar.Year2011 | 43,725.490*** | 43,739.210*** |
| (2,807.960) | (2,808.177) | |
| Calendar.Year2012 | 46,419.780*** | 46,433.590*** |
| (2,807.480) | (2,807.701) | |
| Calendar.Year2013 | 48,718.290*** | 48,732.480*** |
| (2,808.908) | (2,809.139) | |
| Calendar.Year2014 | 51,732.380*** | 51,746.830*** |
| (2,807.829) | (2,808.066) | |
| Calendar.Year2015 | 53,331.590*** | 53,346.510*** |
| (2,805.817) | (2,806.055) | |
| Calendar.Year2016 | 56,443.620*** | 56,458.290*** |
| (2,805.224) | (2,805.464) | |
| Calendar.Year2017 | 59,664.220*** | 59,679.830*** |
| (2,808.626) | (2,808.864) | |
| Employer | Yes | Yes |
| Observations | 94,051 | 94,051 |
| R2 | 0.463 | 0.463 |
| Adjusted R2 | 0.462 | 0.462 |
| Residual Std. Error | 22,465.490 | 22,465.810 |
| F Statistic | 715.326*** | 690.862*** |
| Note: | p<0.1; p<0.05; p<0.01 | |
Normal Q-Q plot does not look linear. Therefore the error is not following normal distribution.
par(mfrow = c(2,2))
plot(md)
D-W test value is close to 2. Hence, this assumption is satisfied.
dwtest(md)
##
## Durbin-Watson test
##
## data: md
## DW = 1.8915, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Here are some related research investigating the last name premium problem:
van Praag, M., & van Praag, B. M. (2004). First-Author Determinants and the Benefits of Being Professor A (and not Z).
“We find that author groups clearly account for the relative individual costs and benefits of being a first author when positioning authors of a group: Economists perceive name ordering as a deliberate decision. The usefulness of this deliberation is supported by the answer to our second question: is scientific output higher for authors whose names rank first in the alphabet? Career prospects are indeed better for academic economists who have high chances to be a first author. This productivity effect does not become visible at the beginning of one’s career: it turns out to be a reputation or visibility effect and it therefore only shows up when an economist’s career is more advanced.”
Efthyvoulou, G. (2008). Alphabet economics: The link between names and reputation. The Journal of Socio-Economics, 37(3), 1266-1285.
“Overall, we find that faculty members with earlier last name initials are more likely to get employment at high standard research departments. Furthermore, we show that the relationship between alphabetical placement and academic success remains significant if we use as an alternative measure of reputation the number of people showing an interest in the papers of a particular academic.”
The previous results do not adjust salary by CPI. I will do it here to see if that affects the main findings. All salaries are adjusted to the 2017 level.
cpi<-read.csv("CanadaCPI.csv")
cpi$year<-as.numeric(substr(cpi$Quarter,1,4))
d.cpi<-data.frame(year=1993:2017,CPI=c(by(cpi$CPI.Inflation,cpi$year,mean)))
d.cpi$pct<-(d.cpi$CPI/100+1)
d.cpi$to2017<-1
for (i in 1993:2016) {
d.cpi$to2017[d.cpi$year==i] <- prod(d.cpi$pct[d.cpi$year %in% i:2017])
}
d.merge<-merge(x = d,y = d.cpi,by.x ="Calendar.Year", by.y = "year",all.x = T )
d<-mutate(d.merge,
Salary.Paid=Salary.Paid*to2017,
Taxable.Benefits=Taxable.Benefits*to2017
)
Here are the results with inflation adjusted salary.
d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
Job.Title+
Employer+
Calendar.Year
)
md.add.first.name <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
First_Name+
Job.Title+
Employer+
Calendar.Year)
stargazer(md,md.add.first.name,
omit = "Employer",
omit.labels = "Employer",
intercept.bottom = F,
df = FALSE,
type="html")
| Dependent variable: | ||
| Salary.Paid | ||
| (1) | (2) | |
| Constant | 108,605.600*** | 108,536.000*** |
| (3,703.862) | (3,706.282) | |
| Last_NameF-J | -1,492.545*** | -1,491.028*** |
| (247.126) | (247.180) | |
| Last_NameK-O | -886.264*** | -885.580*** |
| (224.596) | (224.668) | |
| Last_NameP-T | -342.091 | -339.603 |
| (229.265) | (229.321) | |
| Last_NameT-Z | -1,223.407*** | -1,218.265*** |
| (300.101) | (300.332) | |
| First_NameF-J | 124.782 | |
| (229.731) | ||
| First_NameK-O | -19.573 | |
| (227.736) | ||
| First_NameP-T | 190.556 | |
| (222.306) | ||
| First_NameT-Z | -107.331 | |
| (370.509) | ||
| Job.TitleAssociate Professor | 12,431.790*** | 12,429.340*** |
| (299.786) | (299.813) | |
| Job.TitleProfessor | 37,403.640*** | 37,397.000*** |
| (310.467) | (310.603) | |
| Calendar.Year1998 | 50.616 | 67.260 |
| (3,308.584) | (3,308.714) | |
| Calendar.Year1999 | -1,821.808 | -1,806.956 |
| (3,252.645) | (3,252.761) | |
| Calendar.Year2000 | -3,309.803 | -3,296.403 |
| (3,161.633) | (3,161.788) | |
| Calendar.Year2001 | -5,453.563* | -5,442.926* |
| (3,158.598) | (3,158.742) | |
| Calendar.Year2002 | -7,664.755** | -7,654.777** |
| (3,127.811) | (3,127.949) | |
| Calendar.Year2003 | -5,211.303* | -5,200.639* |
| (3,090.861) | (3,091.018) | |
| Calendar.Year2004 | -13,498.750*** | -13,484.050*** |
| (3,089.949) | (3,090.179) | |
| Calendar.Year2005 | -3,682.918 | -3,667.822 |
| (3,066.692) | (3,066.904) | |
| Calendar.Year2006 | -3,756.728 | -3,740.102 |
| (3,066.829) | (3,067.055) | |
| Calendar.Year2007 | 959.974 | 975.481 |
| (3,059.558) | (3,059.789) | |
| Calendar.Year2008 | 2,277.027 | 2,291.945 |
| (3,056.175) | (3,056.395) | |
| Calendar.Year2009 | 4,521.858 | 4,538.341 |
| (3,051.282) | (3,051.509) | |
| Calendar.Year2010 | 7,640.857** | 7,656.226** |
| (3,049.093) | (3,049.322) | |
| Calendar.Year2011 | 7,848.134*** | 7,864.311*** |
| (3,046.548) | (3,046.781) | |
| Calendar.Year2012 | 6,795.538** | 6,811.963** |
| (3,046.027) | (3,046.265) | |
| Calendar.Year2013 | 7,100.251** | 7,117.161** |
| (3,047.576) | (3,047.825) | |
| Calendar.Year2014 | 9,068.066*** | 9,085.208*** |
| (3,046.406) | (3,046.661) | |
| Calendar.Year2015 | 8,043.251*** | 8,061.040*** |
| (3,044.223) | (3,044.479) | |
| Calendar.Year2016 | 9,684.371*** | 9,701.781*** |
| (3,043.579) | (3,043.839) | |
| Calendar.Year2017 | 8,131.262*** | 8,149.733*** |
| (3,047.270) | (3,047.527) | |
| Employer | Yes | Yes |
| Observations | 94,051 | 94,051 |
| R2 | 0.418 | 0.418 |
| Adjusted R2 | 0.417 | 0.417 |
| Residual Std. Error | 24,374.350 | 24,374.680 |
| F Statistic | 596.978*** | 576.565*** |
| Note: | p<0.1; p<0.05; p<0.01 | |
It would be interesting to see how gender is related to salary. However, the Ontario public sector salary data has no gender. But it does have names. So I matched the first names with the top 100 male names and female names in Canada according to this website. For names not in the top 100 list, the gender variables are “Gender Unknown”. I hope there are not many males using female names or vice versa.
top100names<-read.csv("top100names.csv",colClasses = c("integer","character","character"))
top100names<-top100names %>%
mutate(Boy = tolower(Boy)) %>%
mutate(Girl = tolower(Girl))
set.top.male.name<-c(top100names$Boy)
set.top.female.name<-c(top100names$Girl)
d <-d %>%
mutate(First.Name = tolower(First.Name)) %>%
mutate(Last.Name = tolower(Last.Name)) %>%
mutate(top_100_male_name = case_when(First.Name %in% set.top.male.name~ "YES",
TRUE ~ "NO")) %>%
mutate(top_100_female_name = case_when(First.Name %in% set.top.female.name ~ "YES",
TRUE ~ "NO")) %>%
mutate(Gender = case_when(top_100_male_name == "YES" ~ "Male",
top_100_female_name == "YES" ~ "Female",
TRUE ~ "Gender Unknown"))
Let’s plot the data.
d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
ggplot(d.sub,aes(Job.Title,Salary.Paid))+
geom_boxplot()
ggplot(d.sub,aes(Gender,Salary.Paid))+
geom_boxplot()
d.sub$Calendar.Year<-as.numeric(d.sub$Calendar.Year)
ggplot(d.sub,aes(Calendar.Year,Salary.Paid,color=Gender))+
geom_point(alpha=0.05)+
stat_smooth(size=1)+
facet_grid(~Job.Title)+
theme(legend.position="bottom")
Here are the regression results. Column (3) adds gender and female is the reference group. The results suggests that males are paid $7,315 more than females after controlling the alphabetical order of last names, job titles, employer, and year. Of course, these results are preliminary and the model only explains 41% of the salary variations. Are girls doing the same job as guys but got paid less? Then from an employer’s point of view, they should hire more girls. Then there will be more demand for female workers. As employers compete for female workers by offering them more wage, this would push up female salary. So what is causing this $7,315 salary gap per year between male and female?
| Salary.Paid | |||
| (1) | (2) | (3) | |
| Constant | 98,243.520*** | 98,164.910*** | 92,131.070*** |
| (2,168.682) | (2,171.705) | (2,245.461) | |
| Last_NameF-J | -1,498.886*** | -1,498.243*** | -1,511.285*** |
| (248.313) | (248.367) | (248.232) | |
| Last_NameK-O | -904.142*** | -904.453*** | -897.833*** |
| (225.684) | (225.757) | (225.616) | |
| Last_NameP-T | -358.774 | -356.869 | -379.615* |
| (230.371) | (230.426) | (230.289) | |
| Last_NameT-Z | -1,229.256*** | -1,225.713*** | -1,248.908*** |
| (301.554) | (301.786) | (301.622) | |
| Job.TitleAssociate Professor | 12,415.990*** | 12,413.560*** | 12,413.890*** |
| (301.182) | (301.209) | (300.993) | |
| Job.TitleProfessor | 36,547.560*** | 36,540.660*** | 36,489.230*** |
| (310.130) | (310.272) | (310.084) | |
| First_NameF-J | 162.720 | -50.453 | |
| (230.824) | (231.382) | ||
| First_NameK-O | 11.193 | -43.516 | |
| (228.834) | (229.344) | ||
| First_NameP-T | 212.680 | 220.567 | |
| (223.380) | (225.648) | ||
| First_NameT-Z | -62.995 | -230.031 | |
| (372.294) | (372.302) | ||
| GenderGender Unknown | 5,798.240*** | ||
| (585.245) | |||
| GenderMale | 7,315.277*** | ||
| (629.987) | |||
| Year Dummies? | Yes | Yes | Yes |
| Employer Dummies? | Yes | Yes | Yes |
| Observations | 94,051 | 94,051 | 94,051 |
| R2 | 0.412 | 0.412 | 0.413 |
| Adjusted R2 | 0.412 | 0.412 | 0.412 |
| Residual Std. Error | 24,493.210 | 24,493.540 | 24,475.970 |
| F Statistic | 700.818*** | 672.210*** | 661.081*** |
| Notes: | ***Significant at the 1 percent level. | ||
| **Significant at the 5 percent level. | |||
| *Significant at the 10 percent level. | |||
One of my favorite economists Professor Marc F. Bellemare suggests
one dummy for each letter and see if it’s generally monotonic.
Great suggestion! Let’s generate one dummy for each last name initial and repeat our main model. The results are reported in column (4).
| Salary.Paid | ||||
| (1) | (2) | (3) | (4) | |
| Constant | 98,243.520*** | 98,164.910*** | 92,131.070*** | 90,970.680*** |
| (2,168.682) | (2,171.705) | (2,245.461) | (2,279.831) | |
| Last_NameF-J | -1,498.886*** | -1,498.243*** | -1,511.285*** | |
| (248.313) | (248.367) | (248.232) | ||
| Last_NameK-O | -904.142*** | -904.453*** | -897.833*** | |
| (225.684) | (225.757) | (225.616) | ||
| Last_NameP-T | -358.774 | -356.869 | -379.615* | |
| (230.371) | (230.426) | (230.289) | ||
| Last_NameT-Z | -1,229.256*** | -1,225.713*** | -1,248.908*** | |
| (301.554) | (301.786) | (301.622) | ||
| Job.TitleAssociate Professor | 12,415.990*** | 12,413.560*** | 12,413.890*** | 12,413.980*** |
| (301.182) | (301.209) | (300.993) | (300.847) | |
| Job.TitleProfessor | 36,547.560*** | 36,540.660*** | 36,489.230*** | 36,495.860*** |
| (310.130) | (310.272) | (310.084) | (309.965) | |
| First_NameF-J | 162.720 | -50.453 | -20.195 | |
| (230.824) | (231.382) | (231.421) | ||
| First_NameK-O | 11.193 | -43.516 | -35.168 | |
| (228.834) | (229.344) | (229.596) | ||
| First_NameP-T | 212.680 | 220.567 | 250.036 | |
| (223.380) | (225.648) | (225.928) | ||
| First_NameT-Z | -62.995 | -230.031 | -71.392 | |
| (372.294) | (372.302) | (373.578) | ||
| GenderGender Unknown | 5,798.240*** | 5,781.004*** | ||
| (585.245) | (585.309) | |||
| GenderMale | 7,315.277*** | 7,269.344*** | ||
| (629.987) | (630.034) | |||
| Last.Name.startB | 1,671.008*** | |||
| (503.257) | ||||
| Last.Name.startC | 1,647.468*** | |||
| (518.380) | ||||
| Last.Name.startD | 84.652 | |||
| (553.682) | ||||
| Last.Name.startE | 1,349.730* | |||
| (738.823) | ||||
| Last.Name.startF | 573.804 | |||
| (609.052) | ||||
| Last.Name.startG | -1,397.618** | |||
| (553.792) | ||||
| Last.Name.startH | -872.340 | |||
| (532.563) | ||||
| Last.Name.startI | 861.369 | |||
| (1,021.664) | ||||
| Last.Name.startJ | 1,126.054* | |||
| (674.941) | ||||
| Last.Name.startK | 230.385 | |||
| (552.170) | ||||
| Last.Name.startL | -163.566 | |||
| (537.964) | ||||
| Last.Name.startM | 187.433 | |||
| (491.053) | ||||
| Last.Name.startN | -105.547 | |||
| (690.373) | ||||
| Last.Name.startO | 2,136.139*** | |||
| (783.196) | ||||
| Last.Name.startP | 447.292 | |||
| (557.602) | ||||
| Last.Name.startQ | -3,299.892* | |||
| (1,689.085) | ||||
| Last.Name.startR | 1,712.001*** | |||
| (570.085) | ||||
| Last.Name.startS | 474.608 | |||
| (492.850) | ||||
| Last.Name.startT | 857.377 | |||
| (604.445) | ||||
| Last.Name.startU | -8,422.304*** | |||
| (1,815.731) | ||||
| Last.Name.startV | -534.440 | |||
| (739.546) | ||||
| Last.Name.startW | 438.395 | |||
| (544.148) | ||||
| Last.Name.startX | -10,776.280*** | |||
| (1,753.550) | ||||
| Last.Name.startY | 76.149 | |||
| (884.721) | ||||
| Last.Name.startZ | 1,052.583 | |||
| (880.917) | ||||
| Year Dummies? | Yes | Yes | Yes | Yes |
| Employer Dummies? | Yes | Yes | Yes | Yes |
| Observations | 94,051 | 94,051 | 94,051 | 94,051 |
| R2 | 0.412 | 0.412 | 0.413 | 0.414 |
| Adjusted R2 | 0.412 | 0.412 | 0.412 | 0.413 |
| Residual Std. Error | 24,493.210 | 24,493.540 | 24,475.970 | 24,460.630 |
| F Statistic | 700.818*** | 672.210*** | 661.081*** | 548.181*** |
| Notes: | ***Significant at the 1 percent level. | |||
| **Significant at the 5 percent level. | ||||
| *Significant at the 10 percent level. | ||||
Let’s look at the distribution of the last name initials.
plot(table(d$Last.Name.start),xlab="Last Name Initial", ylab="Number of Observations")
Sort the last name initials by the number of observations.
sort(table(d$Last.Name.start), decreasing = T)
| M | S | B | C | H | L | D | P | G | W | R | K | F | T | A | J | N | V | E | O | Y | Z | I | Q | U | X |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 128684 | 106039 | 101749 | 89644 | 67218 | 65279 | 63820 | 59601 | 57196 | 55843 | 53884 | 49970 | 41377 | 41136 | 38861 | 24291 | 21905 | 19934 | 17830 | 16360 | 8945 | 8832 | 7397 | 2888 | 2309 | 702 |
Some of the last names (i.e. names started with “U” and “X”) are rare and may coincide with the employee’s race and cultural background. This would screw up the model as we have no control on employees race (race data not available in our dataset).
We can only select those popular last name initials with more observations (less likely to be a minority), and see whether our findings remain valid.
Select the most populated last name initials. They are:
set<-names(sort(table(d$Last.Name.start), decreasing = T))[1:14]
set.n.1<-sort(set,decreasing = F)[2:length(set)]
set
## [1] "M" "S" "B" "C" "H" "L" "D" "P" "G" "W" "R" "K" "F" "T"
Here are the results. It appears that our findings remain valid. Your salary is related to the alphabetical order of your last name.
d.sub2<-subset(d.sub,Last.Name.start %in% set)
d.sub<-d.sub2
md <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
Job.Title+
Employer+
Calendar.Year)
md.first.name <- lm(data = d.sub,
Salary.Paid ~ Last_Name+
Job.Title+
Employer+
Calendar.Year+
First_Name)
md.gender<- lm(data = d.sub,
Salary.Paid ~ Last_Name+
Job.Title+
Employer+
Calendar.Year+
First_Name+
Gender )
md.all.letters<- lm(data = d.sub,
Salary.Paid ~Job.Title+
Employer+
Calendar.Year+
First_Name+
Gender+
Last.Name.start)
md.all.letters.100k<- lm(data = d.sub,
(Salary.Paid-100000) ~Job.Title+
Employer+
Calendar.Year+
First_Name+
Gender+
Last.Name.start)
md.all.letters.log<- lm(data = d.sub,
log(Salary.Paid) ~Job.Title+
Employer+
Calendar.Year+
First_Name+
Gender+
Last.Name.start)
md.all.letters.log.100k<- lm(data = d.sub,
log(Salary.Paid-100000) ~Job.Title+
Employer+
Calendar.Year+
First_Name+
Gender+
Last.Name.start)
stargazer(md,
md.first.name,
md.gender,
md.all.letters,
md.all.letters.100k,
md.all.letters.log,
md.all.letters.log.100k,
omit = c("Calendar.Year","Employer"),
omit.labels = c("Year Dummies?","Employer Dummies?"),
df = FALSE,
type="html",
intercept.bottom = F,
style = "aer",covariate.labels = c())
| Salary.Paid | (Salary.Paid - 1e+05) | log(Salary.Paid) | log(Salary.Paid - 1e+05) | ||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | |
| Constant | 100,290.400*** | 100,151.500*** | 93,735.960*** | 93,971.180*** | -6,028.819** | 11.543*** | 9.517*** |
| (2,543.269) | (2,546.127) | (2,621.864) | (2,628.167) | (2,628.167) | (0.015) | (0.057) | |
| Last_NameF-J | -2,000.986*** | -2,010.717*** | -2,018.406*** | ||||
| (275.956) | (276.000) | (275.851) | |||||
| Last_NameK-O | -1,184.096*** | -1,195.836*** | -1,180.312*** | ||||
| (249.172) | (249.248) | (249.104) | |||||
| Last_NameP-T | -492.425** | -501.543** | -516.948** | ||||
| (244.961) | (245.020) | (244.880) | |||||
| Last_NameT-Z | -828.129** | -841.560** | -862.727** | ||||
| (387.678) | (387.841) | (387.555) | |||||
| Job.TitleAssociate Professor | 12,203.100*** | 12,202.730*** | 12,199.920*** | 12,203.580*** | 12,203.580*** | 0.091*** | 0.455*** |
| (331.275) | (331.317) | (331.055) | (331.016) | (331.016) | (0.002) | (0.007) | |
| Job.TitleProfessor | 36,450.200*** | 36,448.150*** | 36,385.650*** | 36,379.050*** | 36,379.050*** | 0.241*** | 0.911*** |
| (341.119) | (341.320) | (341.096) | (341.089) | (341.089) | (0.002) | (0.007) | |
| First_NameF-J | 288.581 | 58.985 | 64.043 | 64.043 | -0.000 | -0.002 | |
| (252.828) | (253.455) | (253.523) | (253.523) | (0.001) | (0.006) | ||
| First_NameK-O | 245.800 | 176.575 | 173.145 | 173.145 | 0.001 | 0.003 | |
| (251.750) | (252.259) | (252.514) | (252.514) | (0.001) | (0.005) | ||
| First_NameP-T | 10.766 | 18.393 | 33.301 | 33.301 | 0.000 | 0.005 | |
| (244.029) | (246.469) | (246.635) | (246.635) | (0.001) | (0.005) | ||
| First_NameT-Z | 682.197* | 501.337 | 530.777 | 530.777 | 0.004* | 0.020** | |
| (409.707) | (409.703) | (410.345) | (410.345) | (0.002) | (0.009) | ||
| GenderGender Unknown | 6,174.504*** | 6,150.345*** | 6,150.345*** | 0.040*** | 0.131*** | ||
| (645.630) | (645.710) | (645.710) | (0.004) | (0.014) | |||
| GenderMale | 7,721.053*** | 7,703.478*** | 7,703.478*** | 0.050*** | 0.160*** | ||
| (692.797) | (692.937) | (692.937) | (0.004) | (0.015) | |||
| Last.Name.startC | -44.973 | -44.973 | 0.000 | 0.007 | |||
| (408.275) | (408.275) | (0.002) | (0.009) | ||||
| Last.Name.startD | -1,560.544*** | -1,560.544*** | -0.009*** | -0.028*** | |||
| (451.858) | (451.858) | (0.003) | (0.010) | ||||
| Last.Name.startF | -1,096.777** | -1,096.777** | -0.006** | -0.023** | |||
| (518.897) | (518.897) | (0.003) | (0.011) | ||||
| Last.Name.startG | -3,093.732*** | -3,093.732*** | -0.017*** | -0.043*** | |||
| (452.659) | (452.659) | (0.003) | (0.010) | ||||
| Last.Name.startH | -2,573.824*** | -2,573.824*** | -0.014*** | -0.038*** | |||
| (425.684) | (425.684) | (0.003) | (0.009) | ||||
| Last.Name.startK | -1,432.061*** | -1,432.061*** | -0.007*** | -0.011 | |||
| (450.390) | (450.390) | (0.003) | (0.010) | ||||
| Last.Name.startL | -1,856.081*** | -1,856.081*** | -0.009*** | -0.018* | |||
| (433.357) | (433.357) | (0.003) | (0.009) | ||||
| Last.Name.startM | -1,486.958*** | -1,486.958*** | -0.007*** | -0.016* | |||
| (372.459) | (372.459) | (0.002) | (0.008) | ||||
| Last.Name.startP | -1,218.943*** | -1,218.943*** | -0.007** | -0.017* | |||
| (457.532) | (457.532) | (0.003) | (0.010) | ||||
| Last.Name.startR | 24.086 | 24.086 | 0.001 | 0.008 | |||
| (472.281) | (472.281) | (0.003) | (0.010) | ||||
| Last.Name.startS | -1,200.992*** | -1,200.992*** | -0.005** | -0.005 | |||
| (374.686) | (374.686) | (0.002) | (0.008) | ||||
| Last.Name.startT | -829.469 | -829.469 | -0.004 | -0.026** | |||
| (514.040) | (514.040) | (0.003) | (0.011) | ||||
| Last.Name.startW | -1,256.888*** | -1,256.888*** | -0.006** | -0.014 | |||
| (440.781) | (440.781) | (0.003) | (0.010) | ||||
| Year Dummies? | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Employer Dummies? | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Observations | 78,538 | 78,538 | 78,538 | 78,538 | 78,538 | 78,538 | 78,538 |
| R2 | 0.413 | 0.413 | 0.414 | 0.414 | 0.414 | 0.485 | 0.568 |
| Adjusted R2 | 0.412 | 0.412 | 0.413 | 0.413 | 0.413 | 0.484 | 0.568 |
| Residual Std. Error | 24,518.980 | 24,518.950 | 24,499.580 | 24,495.420 | 24,495.420 | 0.144 | 0.533 |
| F Statistic | 587.100*** | 563.182*** | 554.051*** | 508.804*** | 508.804*** | 676.530*** | 946.893*** |
| Notes: | ***Significant at the 1 percent level. | ||||||
| **Significant at the 5 percent level. | |||||||
| *Significant at the 10 percent level. | |||||||
Here are the plot of the residuals.
Residuals for the model where salary is unchanged.
par(mfrow = c(2,2))
plot(md.all.letters)
Residuals for the model where $100,000 is subtracted from salary. This is because the our dataset only includes salaries higher than $100,000 per year.
par(mfrow = c(2,2))
plot(md.all.letters.100k)
Residuals for the model where salary is in the log form.
par(mfrow = c(2,2))
plot(md.all.letters.log)
Residuals for the model where $100,000 is subtracted from salary, and then in log form.
par(mfrow = c(2,2))
plot(md.all.letters.log.100k)
Let’s see whether the effect, as hypothesized by Professor Marc Bellemare, would be generally monotonic.
d.tmp<-data.frame(Last.Name.Initial=set.n.1,
Coefficient=coef(md.all.letters)[98:(length(set.n.1)+98-1)],
p.value=summary(md.all.letters)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"
ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
geom_col(aes(fill =significant10))+
ylab("Salary Drop Compared to the Reference Last Name Initial ($)")+
xlab("Last Name Initial")
Compared to employees with Last names started with B, employee Cs are not experiencing significant salary drop. However, for employee Ds and Fs, the salary drop is $1,560 and $1,096 per year. This adverse effect strengthens for professor Gs, reducing his salary by $3,093 per year compared to Bs. After that, the adverse effect shrinks to $2,500, $2,000, and then $1,500 as the last names started further away from B. Eventually, the adverse last name effect stops at about $1,200 per year. So we could conclude that the adverse last name effect increases first and then decreases. This makes sense. Think of a case where we need to read something quickly: we would look at the first few pages, skip the middle, and then roughly read the last few pages. Thus, the adverse effect would be the largest if the last name is in the middle.
How about the last model where we subtracted $10,000 from the salary variable and then take a log? The R square increased to 57%, suggesting that this model fits the data better. Let’s plot the estimated coefficients of the impact of last name initial on salary. It shows a similar pattern.
d.tmp<-data.frame(Last.Name.Initial=set.n.1,
Coefficient=c(coef(md.all.letters.log.100k)[98:(length(set.n.1)+98-1)]),
p.value=summary(md.all.letters.log.100k)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"
ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
geom_col(aes(fill=significant10))+
ylab("% Drop on (salary - 10k) Compared to the Reference Last Name Initial")+
xlab("Last Name Initial")
Well, my father may be mad. But next time you see me, Max Shang is no longer my name. I’m Max Aaaa.