1 Question

Is your name related to how much you get paid? How much more is person AAA paid compared to ZZZ?

2 Hypothesis

Salary is NOT correlated with the alphabetical order of one’s last name. That is professor ABCs get paid similar to professor XYZs.
Male and femamle workers are paid the same for the same job.

3 Key findings

The correlation between the alphabetical order of one’s last name and his/her salary is positive and significant. Compared to last names started with A to E, the employees with other last names get paid less, around $900 to $1,500 less per year.
Female professors get paid $6,000 less than males controlling job title, employer and the alphabetical order of the last name.
Compared to assistant professor, the salary premium of the Associate Professor and Professor is $12,500 and $36,500 per year, respectively.

More details and my R code (click CODE to display) follows.

4 Data source

Ontario Public sector salary disclosure: individual level data includes:

d<-read_rds("pssd_1996_2017.rds")
d$Calendar.Year<-as.factor(d$Calendar.Year)
names(d)

## [1] "Sector"           "Last.Name"        "First.Name"      
## [4] "Salary.Paid"      "Taxable.Benefits" "Employer"        
## [7] "Job.Title"        "Calendar.Year"

4.1 Data Clean Process

The data for each year are downloaded and combined together. I also generated new variables Last.Name.Start. It is a factor indicates whether someone’s surname initial is in A-E, F-J, K-O, P-T, or T-Z.

d<-d %>% 
  mutate(First.Name.start=substr(First.Name,1,1),
         Last.Name.start=substr(Last.Name,1,1)) %>% 
  filter(First.Name.start %in% c(LETTERS) &
           Last.Name.start %in% c(LETTERS))     %>% 
  mutate(First_Name=case_when(First.Name.start %in% LETTERS[1:5] ~ "A-E",
                           First.Name.start %in% LETTERS[6:10] ~ "F-J",
                           First.Name.start %in% LETTERS[11:15] ~ "K-O",
                           First.Name.start %in% LETTERS[16:20] ~ "P-T",
                           First.Name.start %in% LETTERS[20:26] ~ "T-Z"
                           ))                   %>% 
  mutate(Last_Name=case_when( Last.Name.start %in% LETTERS[1:5] ~ "A-E",
                           Last.Name.start %in% LETTERS[6:10] ~ "F-J",
                           Last.Name.start %in% LETTERS[11:15] ~ "K-O",
                           Last.Name.start %in% LETTERS[16:20] ~ "P-T",
                           Last.Name.start %in% LETTERS[20:26] ~ "T-Z"
  ))

5 Preliminary Results

This preliminary analysis is on individuals with a job title Assistant Professor, Associate Professor, or Professor.

Here are the regression results.

Last Name A - E is the reference group. Compared to this group, people with other last names (lower by alphabetical order) get paid less.

Column (2) adds the initial of the first name as an explanatory variable. It’s is not significant. But the initials of the last name is still significant.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year
)

md.add.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         First_Name+
         Job.Title+
         Employer+
         Calendar.Year)
         
stargazer(md,md.add.first.name,
          omit = "Employer",
          omit.labels = "Employer",
          intercept.bottom = F,
          df = FALSE,
          type="html")


	Dependent variable:

	Salary.Paid
	(1)	(2)

Constant	58,902.860^***	58,818.970^***
	(3,413.797)	(3,416.030)

Last_NameF-J	-1,386.548^***	-1,385.900^***
	(227.773)	(227.823)

Last_NameK-O	-828.753^***	-829.437^***
	(207.007)	(207.074)

Last_NameP-T	-322.248	-320.559
	(211.311)	(211.362)

Last_NameT-Z	-1,108.991^***	-1,106.210^***
	(276.599)	(276.812)

First_NameF-J		120.286
		(211.740)

First_NameK-O		28.464
		(209.901)

First_NameP-T		206.241
		(204.896)

First_NameT-Z		-16.824
		(341.493)

Job.TitleAssociate Professor	11,846.280^***	11,844.550^***
	(276.308)	(276.334)

Job.TitleProfessor	34,952.760^***	34,947.670^***
	(286.153)	(286.279)

Calendar.Year1998	4,219.026	4,233.896
	(3,049.475)	(3,049.597)

Calendar.Year1999	3,905.832	3,919.749
	(2,997.917)	(2,998.025)

Calendar.Year2000	4,869.895^*	4,882.420^*
	(2,914.033)	(2,914.176)

Calendar.Year2001	6,515.093^**	6,524.331^**
	(2,911.235)	(2,911.369)

Calendar.Year2002	7,750.415^***	7,759.718^***
	(2,882.859)	(2,882.988)

Calendar.Year2003	12,559.320^***	12,568.870^***
	(2,848.803)	(2,848.949)

Calendar.Year2004	7,839.397^***	7,852.529^***
	(2,847.962)	(2,848.176)

Calendar.Year2005	19,297.630^***	19,311.460^***
	(2,826.526)	(2,826.723)

Calendar.Year2006	21,915.240^***	21,929.960^***
	(2,826.652)	(2,826.863)

Calendar.Year2007	28,603.920^***	28,617.230^***
	(2,819.952)	(2,820.166)

Calendar.Year2008	32,577.590^***	32,590.420^***
	(2,816.833)	(2,817.037)

Calendar.Year2009	37,890.150^***	37,904.000^***
	(2,812.323)	(2,812.534)

Calendar.Year2010	41,064.560^***	41,077.630^***
	(2,810.306)	(2,810.519)

Calendar.Year2011	43,725.490^***	43,739.210^***
	(2,807.960)	(2,808.177)

Calendar.Year2012	46,419.780^***	46,433.590^***
	(2,807.480)	(2,807.701)

Calendar.Year2013	48,718.290^***	48,732.480^***
	(2,808.908)	(2,809.139)

Calendar.Year2014	51,732.380^***	51,746.830^***
	(2,807.829)	(2,808.066)

Calendar.Year2015	53,331.590^***	53,346.510^***
	(2,805.817)	(2,806.055)

Calendar.Year2016	56,443.620^***	56,458.290^***
	(2,805.224)	(2,805.464)

Calendar.Year2017	59,664.220^***	59,679.830^***
	(2,808.626)	(2,808.864)


Employer	Yes	Yes

Observations	94,051	94,051
R²	0.463	0.463
Adjusted R²	0.462	0.462
Residual Std. Error	22,465.490	22,465.810
F Statistic	715.326^***	690.862^***

Note:	p<0.1; p<0.05; p<0.01

6 Test to see violation of model assumptions

Error should follow normal distribution

Normal Q-Q plot does not look linear. Therefore the error is not following normal distribution.

par(mfrow = c(2,2))
plot(md)

No autocorrelation

D-W test value is close to 2. Hence, this assumption is satisfied.

dwtest(md)

## 
##  Durbin-Watson test
## 
## data:  md
## DW = 1.8915, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

7 Why so?

Here are some related research investigating the last name premium problem:

van Praag, M., & van Praag, B. M. (2004). First-Author Determinants and the Benefits of Being Professor A (and not Z).

“We find that author groups clearly account for the relative individual costs and benefits of being a first author when positioning authors of a group: Economists perceive name ordering as a deliberate decision. The usefulness of this deliberation is supported by the answer to our second question: is scientific output higher for authors whose names rank first in the alphabet? Career prospects are indeed better for academic economists who have high chances to be a first author. This productivity effect does not become visible at the beginning of one’s career: it turns out to be a reputation or visibility effect and it therefore only shows up when an economist’s career is more advanced.”

Efthyvoulou, G. (2008). Alphabet economics: The link between names and reputation. The Journal of Socio-Economics, 37(3), 1266-1285.

“Overall, we find that faculty members with earlier last name initials are more likely to get employment at high standard research departments. Furthermore, we show that the relationship between alphabetical placement and academic success remains significant if we use as an alternative measure of reputation the number of people showing an interest in the papers of a particular academic.”

8 Adjust salary by CPI

The previous results do not adjust salary by CPI. I will do it here to see if that affects the main findings. All salaries are adjusted to the 2017 level.

cpi<-read.csv("CanadaCPI.csv")
cpi$year<-as.numeric(substr(cpi$Quarter,1,4))
d.cpi<-data.frame(year=1993:2017,CPI=c(by(cpi$CPI.Inflation,cpi$year,mean)))
d.cpi$pct<-(d.cpi$CPI/100+1)

d.cpi$to2017<-1
for (i in 1993:2016) {
d.cpi$to2017[d.cpi$year==i] <- prod(d.cpi$pct[d.cpi$year %in% i:2017])
}

d.merge<-merge(x = d,y = d.cpi,by.x ="Calendar.Year", by.y = "year",all.x = T )

d<-mutate(d.merge,
          Salary.Paid=Salary.Paid*to2017,
          Taxable.Benefits=Taxable.Benefits*to2017
          )

Here are the results with inflation adjusted salary.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year
)

md.add.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         First_Name+
         Job.Title+
         Employer+
         Calendar.Year)
         
stargazer(md,md.add.first.name,
          omit = "Employer",
          omit.labels = "Employer",
          intercept.bottom = F,
          df = FALSE,
          type="html")


	Dependent variable:

	Salary.Paid
	(1)	(2)

Constant	108,605.600^***	108,536.000^***
	(3,703.862)	(3,706.282)

Last_NameF-J	-1,492.545^***	-1,491.028^***
	(247.126)	(247.180)

Last_NameK-O	-886.264^***	-885.580^***
	(224.596)	(224.668)

Last_NameP-T	-342.091	-339.603
	(229.265)	(229.321)

Last_NameT-Z	-1,223.407^***	-1,218.265^***
	(300.101)	(300.332)

First_NameF-J		124.782
		(229.731)

First_NameK-O		-19.573
		(227.736)

First_NameP-T		190.556
		(222.306)

First_NameT-Z		-107.331
		(370.509)

Job.TitleAssociate Professor	12,431.790^***	12,429.340^***
	(299.786)	(299.813)

Job.TitleProfessor	37,403.640^***	37,397.000^***
	(310.467)	(310.603)

Calendar.Year1998	50.616	67.260
	(3,308.584)	(3,308.714)

Calendar.Year1999	-1,821.808	-1,806.956
	(3,252.645)	(3,252.761)

Calendar.Year2000	-3,309.803	-3,296.403
	(3,161.633)	(3,161.788)

Calendar.Year2001	-5,453.563^*	-5,442.926^*
	(3,158.598)	(3,158.742)

Calendar.Year2002	-7,664.755^**	-7,654.777^**
	(3,127.811)	(3,127.949)

Calendar.Year2003	-5,211.303^*	-5,200.639^*
	(3,090.861)	(3,091.018)

Calendar.Year2004	-13,498.750^***	-13,484.050^***
	(3,089.949)	(3,090.179)

Calendar.Year2005	-3,682.918	-3,667.822
	(3,066.692)	(3,066.904)

Calendar.Year2006	-3,756.728	-3,740.102
	(3,066.829)	(3,067.055)

Calendar.Year2007	959.974	975.481
	(3,059.558)	(3,059.789)

Calendar.Year2008	2,277.027	2,291.945
	(3,056.175)	(3,056.395)

Calendar.Year2009	4,521.858	4,538.341
	(3,051.282)	(3,051.509)

Calendar.Year2010	7,640.857^**	7,656.226^**
	(3,049.093)	(3,049.322)

Calendar.Year2011	7,848.134^***	7,864.311^***
	(3,046.548)	(3,046.781)

Calendar.Year2012	6,795.538^**	6,811.963^**
	(3,046.027)	(3,046.265)

Calendar.Year2013	7,100.251^**	7,117.161^**
	(3,047.576)	(3,047.825)

Calendar.Year2014	9,068.066^***	9,085.208^***
	(3,046.406)	(3,046.661)

Calendar.Year2015	8,043.251^***	8,061.040^***
	(3,044.223)	(3,044.479)

Calendar.Year2016	9,684.371^***	9,701.781^***
	(3,043.579)	(3,043.839)

Calendar.Year2017	8,131.262^***	8,149.733^***
	(3,047.270)	(3,047.527)


Employer	Yes	Yes

Observations	94,051	94,051
R²	0.418	0.418
Adjusted R²	0.417	0.417
Residual Std. Error	24,374.350	24,374.680
F Statistic	596.978^***	576.565^***

Note:	p<0.1; p<0.05; p<0.01

9 Male Names vs. Female Names

It would be interesting to see how gender is related to salary. However, the Ontario public sector salary data has no gender. But it does have names. So I matched the first names with the top 100 male names and female names in Canada according to this website. For names not in the top 100 list, the gender variables are “Gender Unknown”. I hope there are not many males using female names or vice versa.

top100names<-read.csv("top100names.csv",colClasses = c("integer","character","character"))
top100names<-top100names %>% 
mutate(Boy = tolower(Boy)) %>% 
  mutate(Girl = tolower(Girl)) 
set.top.male.name<-c(top100names$Boy)  
set.top.female.name<-c(top100names$Girl)  




d <-d %>% 
  mutate(First.Name = tolower(First.Name)) %>% 
  mutate(Last.Name = tolower(Last.Name)) %>% 
  mutate(top_100_male_name = case_when(First.Name %in%  set.top.male.name~ "YES",
                                       TRUE ~ "NO")) %>% 
  mutate(top_100_female_name = case_when(First.Name %in% set.top.female.name ~ "YES",
                                       TRUE ~ "NO")) %>% 
  mutate(Gender = case_when(top_100_male_name == "YES" ~ "Male",
                          top_100_female_name == "YES" ~ "Female",
                                       TRUE ~ "Gender Unknown"))

9.1 Plots

Let’s plot the data.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
ggplot(d.sub,aes(Job.Title,Salary.Paid))+
  geom_boxplot()

ggplot(d.sub,aes(Gender,Salary.Paid))+
  geom_boxplot()

d.sub$Calendar.Year<-as.numeric(d.sub$Calendar.Year)
ggplot(d.sub,aes(Calendar.Year,Salary.Paid,color=Gender))+
  geom_point(alpha=0.05)+
  stat_smooth(size=1)+
  facet_grid(~Job.Title)+
  theme(legend.position="bottom")

9.2 Results

Here are the regression results. Column (3) adds gender and female is the reference group. The results suggests that males are paid $7,315 more than females after controlling the alphabetical order of last names, job titles, employer, and year. Of course, these results are preliminary and the model only explains 41% of the salary variations. Are girls doing the same job as guys but got paid less? Then from an employer’s point of view, they should hire more girls. Then there will be more demand for female workers. As employers compete for female workers by offering them more wage, this would push up female salary. So what is causing this $7,315 salary gap per year between male and female?


	Salary.Paid
	(1)	(2)	(3)

Constant	98,243.520^***	98,164.910^***	92,131.070^***
	(2,168.682)	(2,171.705)	(2,245.461)

Last_NameF-J	-1,498.886^***	-1,498.243^***	-1,511.285^***
	(248.313)	(248.367)	(248.232)

Last_NameK-O	-904.142^***	-904.453^***	-897.833^***
	(225.684)	(225.757)	(225.616)

Last_NameP-T	-358.774	-356.869	-379.615^*
	(230.371)	(230.426)	(230.289)

Last_NameT-Z	-1,229.256^***	-1,225.713^***	-1,248.908^***
	(301.554)	(301.786)	(301.622)

Job.TitleAssociate Professor	12,415.990^***	12,413.560^***	12,413.890^***
	(301.182)	(301.209)	(300.993)

Job.TitleProfessor	36,547.560^***	36,540.660^***	36,489.230^***
	(310.130)	(310.272)	(310.084)

First_NameF-J		162.720	-50.453
		(230.824)	(231.382)

First_NameK-O		11.193	-43.516
		(228.834)	(229.344)

First_NameP-T		212.680	220.567
		(223.380)	(225.648)

First_NameT-Z		-62.995	-230.031
		(372.294)	(372.302)

GenderGender Unknown			5,798.240^***
			(585.245)

GenderMale			7,315.277^***
			(629.987)

Year Dummies?	Yes	Yes	Yes
Employer Dummies?	Yes	Yes	Yes
Observations	94,051	94,051	94,051
R²	0.412	0.412	0.413
Adjusted R²	0.412	0.412	0.412
Residual Std. Error	24,493.210	24,493.540	24,475.970
F Statistic	700.818^***	672.210^***	661.081^***

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

10 A dummy for each last name initial

One of my favorite economists Professor Marc F. Bellemare suggests

one dummy for each letter and see if it’s generally monotonic.

Great suggestion! Let’s generate one dummy for each last name initial and repeat our main model. The results are reported in column (4).


	Salary.Paid
	(1)	(2)	(3)	(4)

Constant	98,243.520^***	98,164.910^***	92,131.070^***	90,970.680^***
	(2,168.682)	(2,171.705)	(2,245.461)	(2,279.831)

Last_NameF-J	-1,498.886^***	-1,498.243^***	-1,511.285^***
	(248.313)	(248.367)	(248.232)

Last_NameK-O	-904.142^***	-904.453^***	-897.833^***
	(225.684)	(225.757)	(225.616)

Last_NameP-T	-358.774	-356.869	-379.615^*
	(230.371)	(230.426)	(230.289)

Last_NameT-Z	-1,229.256^***	-1,225.713^***	-1,248.908^***
	(301.554)	(301.786)	(301.622)

Job.TitleAssociate Professor	12,415.990^***	12,413.560^***	12,413.890^***	12,413.980^***
	(301.182)	(301.209)	(300.993)	(300.847)

Job.TitleProfessor	36,547.560^***	36,540.660^***	36,489.230^***	36,495.860^***
	(310.130)	(310.272)	(310.084)	(309.965)

First_NameF-J		162.720	-50.453	-20.195
		(230.824)	(231.382)	(231.421)

First_NameK-O		11.193	-43.516	-35.168
		(228.834)	(229.344)	(229.596)

First_NameP-T		212.680	220.567	250.036
		(223.380)	(225.648)	(225.928)

First_NameT-Z		-62.995	-230.031	-71.392
		(372.294)	(372.302)	(373.578)

GenderGender Unknown			5,798.240^***	5,781.004^***
			(585.245)	(585.309)

GenderMale			7,315.277^***	7,269.344^***
			(629.987)	(630.034)

Last.Name.startB				1,671.008^***
				(503.257)

Last.Name.startC				1,647.468^***
				(518.380)

Last.Name.startD				84.652
				(553.682)

Last.Name.startE				1,349.730^*
				(738.823)

Last.Name.startF				573.804
				(609.052)

Last.Name.startG				-1,397.618^**
				(553.792)

Last.Name.startH				-872.340
				(532.563)

Last.Name.startI				861.369
				(1,021.664)

Last.Name.startJ				1,126.054^*
				(674.941)

Last.Name.startK				230.385
				(552.170)

Last.Name.startL				-163.566
				(537.964)

Last.Name.startM				187.433
				(491.053)

Last.Name.startN				-105.547
				(690.373)

Last.Name.startO				2,136.139^***
				(783.196)

Last.Name.startP				447.292
				(557.602)

Last.Name.startQ				-3,299.892^*
				(1,689.085)

Last.Name.startR				1,712.001^***
				(570.085)

Last.Name.startS				474.608
				(492.850)

Last.Name.startT				857.377
				(604.445)

Last.Name.startU				-8,422.304^***
				(1,815.731)

Last.Name.startV				-534.440
				(739.546)

Last.Name.startW				438.395
				(544.148)

Last.Name.startX				-10,776.280^***
				(1,753.550)

Last.Name.startY				76.149
				(884.721)

Last.Name.startZ				1,052.583
				(880.917)

Year Dummies?	Yes	Yes	Yes	Yes
Employer Dummies?	Yes	Yes	Yes	Yes
Observations	94,051	94,051	94,051	94,051
R²	0.412	0.412	0.413	0.414
Adjusted R²	0.412	0.412	0.412	0.413
Residual Std. Error	24,493.210	24,493.540	24,475.970	24,460.630
F Statistic	700.818^***	672.210^***	661.081^***	548.181^***

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

10.1 Rare last name ranked lower in alphabetical order coincide with race

Let’s look at the distribution of the last name initials.

plot(table(d$Last.Name.start),xlab="Last Name Initial", ylab="Number of Observations")

Sort the last name initials by the number of observations.

sort(table(d$Last.Name.start), decreasing = T)

M	S	B	C	H	L	D	P	G	W	R	K	F	T	A	J	N	V	E	O	Y	Z	I	Q	U	X
128684	106039	101749	89644	67218	65279	63820	59601	57196	55843	53884	49970	41377	41136	38861	24291	21905	19934	17830	16360	8945	8832	7397	2888	2309	702

Some of the last names (i.e. names started with “U” and “X”) are rare and may coincide with the employee’s race and cultural background. This would screw up the model as we have no control on employees race (race data not available in our dataset).

We can only select those popular last name initials with more observations (less likely to be a minority), and see whether our findings remain valid.

10.2 Last name with more observations

Select the most populated last name initials. They are:

set<-names(sort(table(d$Last.Name.start), decreasing = T))[1:14]

set.n.1<-sort(set,decreasing = F)[2:length(set)]
set

##  [1] "M" "S" "B" "C" "H" "L" "D" "P" "G" "W" "R" "K" "F" "T"

Here are the results. It appears that our findings remain valid. Your salary is related to the alphabetical order of your last name.

d.sub2<-subset(d.sub,Last.Name.start %in% set)

d.sub<-d.sub2
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year)

md.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year+
         First_Name)

md.gender<- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
          Gender )

md.all.letters<- lm(data = d.sub,
         Salary.Paid ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.100k<- lm(data = d.sub,
         (Salary.Paid-100000) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.log<- lm(data = d.sub,
         log(Salary.Paid) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.log.100k<- lm(data = d.sub,
         log(Salary.Paid-100000) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)


stargazer(md,
          md.first.name,
          md.gender,
          md.all.letters,
          md.all.letters.100k,
          md.all.letters.log,
          md.all.letters.log.100k,
          omit = c("Calendar.Year","Employer"),
          omit.labels = c("Year Dummies?","Employer Dummies?"),
          df = FALSE,
          type="html",
          intercept.bottom = F,
          style = "aer",covariate.labels = c())


	Salary.Paid				(Salary.Paid - 1e+05)	log(Salary.Paid)	log(Salary.Paid - 1e+05)
	(1)	(2)	(3)	(4)	(5)	(6)	(7)

Constant	100,290.400^***	100,151.500^***	93,735.960^***	93,971.180^***	-6,028.819^**	11.543^***	9.517^***
	(2,543.269)	(2,546.127)	(2,621.864)	(2,628.167)	(2,628.167)	(0.015)	(0.057)

Last_NameF-J	-2,000.986^***	-2,010.717^***	-2,018.406^***
	(275.956)	(276.000)	(275.851)

Last_NameK-O	-1,184.096^***	-1,195.836^***	-1,180.312^***
	(249.172)	(249.248)	(249.104)

Last_NameP-T	-492.425^**	-501.543^**	-516.948^**
	(244.961)	(245.020)	(244.880)

Last_NameT-Z	-828.129^**	-841.560^**	-862.727^**
	(387.678)	(387.841)	(387.555)

Job.TitleAssociate Professor	12,203.100^***	12,202.730^***	12,199.920^***	12,203.580^***	12,203.580^***	0.091^***	0.455^***
	(331.275)	(331.317)	(331.055)	(331.016)	(331.016)	(0.002)	(0.007)

Job.TitleProfessor	36,450.200^***	36,448.150^***	36,385.650^***	36,379.050^***	36,379.050^***	0.241^***	0.911^***
	(341.119)	(341.320)	(341.096)	(341.089)	(341.089)	(0.002)	(0.007)

First_NameF-J		288.581	58.985	64.043	64.043	-0.000	-0.002
		(252.828)	(253.455)	(253.523)	(253.523)	(0.001)	(0.006)

First_NameK-O		245.800	176.575	173.145	173.145	0.001	0.003
		(251.750)	(252.259)	(252.514)	(252.514)	(0.001)	(0.005)

First_NameP-T		10.766	18.393	33.301	33.301	0.000	0.005
		(244.029)	(246.469)	(246.635)	(246.635)	(0.001)	(0.005)

First_NameT-Z		682.197^*	501.337	530.777	530.777	0.004^*	0.020^**
		(409.707)	(409.703)	(410.345)	(410.345)	(0.002)	(0.009)

GenderGender Unknown			6,174.504^***	6,150.345^***	6,150.345^***	0.040^***	0.131^***
			(645.630)	(645.710)	(645.710)	(0.004)	(0.014)

GenderMale			7,721.053^***	7,703.478^***	7,703.478^***	0.050^***	0.160^***
			(692.797)	(692.937)	(692.937)	(0.004)	(0.015)

Last.Name.startC				-44.973	-44.973	0.000	0.007
				(408.275)	(408.275)	(0.002)	(0.009)

Last.Name.startD				-1,560.544^***	-1,560.544^***	-0.009^***	-0.028^***
				(451.858)	(451.858)	(0.003)	(0.010)

Last.Name.startF				-1,096.777^**	-1,096.777^**	-0.006^**	-0.023^**
				(518.897)	(518.897)	(0.003)	(0.011)

Last.Name.startG				-3,093.732^***	-3,093.732^***	-0.017^***	-0.043^***
				(452.659)	(452.659)	(0.003)	(0.010)

Last.Name.startH				-2,573.824^***	-2,573.824^***	-0.014^***	-0.038^***
				(425.684)	(425.684)	(0.003)	(0.009)

Last.Name.startK				-1,432.061^***	-1,432.061^***	-0.007^***	-0.011
				(450.390)	(450.390)	(0.003)	(0.010)

Last.Name.startL				-1,856.081^***	-1,856.081^***	-0.009^***	-0.018^*
				(433.357)	(433.357)	(0.003)	(0.009)

Last.Name.startM				-1,486.958^***	-1,486.958^***	-0.007^***	-0.016^*
				(372.459)	(372.459)	(0.002)	(0.008)

Last.Name.startP				-1,218.943^***	-1,218.943^***	-0.007^**	-0.017^*
				(457.532)	(457.532)	(0.003)	(0.010)

Last.Name.startR				24.086	24.086	0.001	0.008
				(472.281)	(472.281)	(0.003)	(0.010)

Last.Name.startS				-1,200.992^***	-1,200.992^***	-0.005^**	-0.005
				(374.686)	(374.686)	(0.002)	(0.008)

Last.Name.startT				-829.469	-829.469	-0.004	-0.026^**
				(514.040)	(514.040)	(0.003)	(0.011)

Last.Name.startW				-1,256.888^***	-1,256.888^***	-0.006^**	-0.014
				(440.781)	(440.781)	(0.003)	(0.010)

Year Dummies?	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Employer Dummies?	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Observations	78,538	78,538	78,538	78,538	78,538	78,538	78,538
R²	0.413	0.413	0.414	0.414	0.414	0.485	0.568
Adjusted R²	0.412	0.412	0.413	0.413	0.413	0.484	0.568
Residual Std. Error	24,518.980	24,518.950	24,499.580	24,495.420	24,495.420	0.144	0.533
F Statistic	587.100^***	563.182^***	554.051^***	508.804^***	508.804^***	676.530^***	946.893^***

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

Here are the plot of the residuals.

Residuals for the model where salary is unchanged.

par(mfrow = c(2,2))
plot(md.all.letters)

Residuals for the model where $100,000 is subtracted from salary. This is because the our dataset only includes salaries higher than $100,000 per year.

par(mfrow = c(2,2))
plot(md.all.letters.100k)

Residuals for the model where salary is in the log form.

par(mfrow = c(2,2))
plot(md.all.letters.log)

Residuals for the model where $100,000 is subtracted from salary, and then in log form.

par(mfrow = c(2,2))
plot(md.all.letters.log.100k)

Let’s see whether the effect, as hypothesized by Professor Marc Bellemare, would be generally monotonic.

d.tmp<-data.frame(Last.Name.Initial=set.n.1,
                  Coefficient=coef(md.all.letters)[98:(length(set.n.1)+98-1)],
                  p.value=summary(md.all.letters)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"

ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
  geom_col(aes(fill =significant10))+
  ylab("Salary Drop Compared to the Reference Last Name Initial ($)")+
  xlab("Last Name Initial")

Compared to employees with Last names started with B, employee Cs are not experiencing significant salary drop. However, for employee Ds and Fs, the salary drop is $1,560 and $1,096 per year. This adverse effect strengthens for professor Gs, reducing his salary by $3,093 per year compared to Bs. After that, the adverse effect shrinks to $2,500, $2,000, and then $1,500 as the last names started further away from B. Eventually, the adverse last name effect stops at about $1,200 per year. So we could conclude that the adverse last name effect increases first and then decreases. This makes sense. Think of a case where we need to read something quickly: we would look at the first few pages, skip the middle, and then roughly read the last few pages. Thus, the adverse effect would be the largest if the last name is in the middle.

How about the last model where we subtracted $10,000 from the salary variable and then take a log? The R square increased to 57%, suggesting that this model fits the data better. Let’s plot the estimated coefficients of the impact of last name initial on salary. It shows a similar pattern.

d.tmp<-data.frame(Last.Name.Initial=set.n.1,
                  Coefficient=c(coef(md.all.letters.log.100k)[98:(length(set.n.1)+98-1)]),
                  p.value=summary(md.all.letters.log.100k)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"

ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
  geom_col(aes(fill=significant10))+
  ylab("% Drop on (salary - 10k) Compared to the Reference Last Name Initial")+
  xlab("Last Name Initial")

Well, my father may be mad. But next time you see me, Max Shang is no longer my name. I’m Max Aaaa.

The Economics of Family Names

Max Shang Aaaa

University of Guelph, Ridgetown Campus

zshang@uoguelph.ca

Twitter @MaxZYShang

April 7, 2018