1 Question

Is your name related to how much you get paid? How much more is person AAA paid compared to ZZZ?

2 Hypothesis

  • Salary is NOT correlated with the alphabetical order of one’s last name. That is professor ABCs get paid similar to professor XYZs.

  • Male and femamle workers are paid the same for the same job.

3 Key findings

  • The correlation between the alphabetical order of one’s last name and his/her salary is positive and significant. Compared to last names started with A to E, the employees with other last names get paid less, around $900 to $1,500 less per year.

  • Female professors get paid $6,000 less than males controlling job title, employer and the alphabetical order of the last name.

  • Compared to assistant professor, the salary premium of the Associate Professor and Professor is $12,500 and $36,500 per year, respectively.

More details and my R code (click CODE to display) follows.

4 Data source

Ontario Public sector salary disclosure: individual level data includes:

d<-read_rds("pssd_1996_2017.rds")
d$Calendar.Year<-as.factor(d$Calendar.Year)
names(d)
## [1] "Sector"           "Last.Name"        "First.Name"      
## [4] "Salary.Paid"      "Taxable.Benefits" "Employer"        
## [7] "Job.Title"        "Calendar.Year"

4.1 Data Clean Process

The data for each year are downloaded and combined together. I also generated new variables Last.Name.Start. It is a factor indicates whether someone’s surname initial is in A-E, F-J, K-O, P-T, or T-Z.

d<-d %>% 
  mutate(First.Name.start=substr(First.Name,1,1),
         Last.Name.start=substr(Last.Name,1,1)) %>% 
  filter(First.Name.start %in% c(LETTERS) &
           Last.Name.start %in% c(LETTERS))     %>% 
  mutate(First_Name=case_when(First.Name.start %in% LETTERS[1:5] ~ "A-E",
                           First.Name.start %in% LETTERS[6:10] ~ "F-J",
                           First.Name.start %in% LETTERS[11:15] ~ "K-O",
                           First.Name.start %in% LETTERS[16:20] ~ "P-T",
                           First.Name.start %in% LETTERS[20:26] ~ "T-Z"
                           ))                   %>% 
  mutate(Last_Name=case_when( Last.Name.start %in% LETTERS[1:5] ~ "A-E",
                           Last.Name.start %in% LETTERS[6:10] ~ "F-J",
                           Last.Name.start %in% LETTERS[11:15] ~ "K-O",
                           Last.Name.start %in% LETTERS[16:20] ~ "P-T",
                           Last.Name.start %in% LETTERS[20:26] ~ "T-Z"
  ))  

5 Preliminary Results

This preliminary analysis is on individuals with a job title Assistant Professor, Associate Professor, or Professor.

Here are the regression results.

Last Name A - E is the reference group. Compared to this group, people with other last names (lower by alphabetical order) get paid less.

Column (2) adds the initial of the first name as an explanatory variable. It’s is not significant. But the initials of the last name is still significant.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year
)

md.add.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         First_Name+
         Job.Title+
         Employer+
         Calendar.Year)
         
stargazer(md,md.add.first.name,
          omit = "Employer",
          omit.labels = "Employer",
          intercept.bottom = F,
          df = FALSE,
          type="html")
Dependent variable:
Salary.Paid
(1) (2)
Constant 58,902.860*** 58,818.970***
(3,413.797) (3,416.030)
Last_NameF-J -1,386.548*** -1,385.900***
(227.773) (227.823)
Last_NameK-O -828.753*** -829.437***
(207.007) (207.074)
Last_NameP-T -322.248 -320.559
(211.311) (211.362)
Last_NameT-Z -1,108.991*** -1,106.210***
(276.599) (276.812)
First_NameF-J 120.286
(211.740)
First_NameK-O 28.464
(209.901)
First_NameP-T 206.241
(204.896)
First_NameT-Z -16.824
(341.493)
Job.TitleAssociate Professor 11,846.280*** 11,844.550***
(276.308) (276.334)
Job.TitleProfessor 34,952.760*** 34,947.670***
(286.153) (286.279)
Calendar.Year1998 4,219.026 4,233.896
(3,049.475) (3,049.597)
Calendar.Year1999 3,905.832 3,919.749
(2,997.917) (2,998.025)
Calendar.Year2000 4,869.895* 4,882.420*
(2,914.033) (2,914.176)
Calendar.Year2001 6,515.093** 6,524.331**
(2,911.235) (2,911.369)
Calendar.Year2002 7,750.415*** 7,759.718***
(2,882.859) (2,882.988)
Calendar.Year2003 12,559.320*** 12,568.870***
(2,848.803) (2,848.949)
Calendar.Year2004 7,839.397*** 7,852.529***
(2,847.962) (2,848.176)
Calendar.Year2005 19,297.630*** 19,311.460***
(2,826.526) (2,826.723)
Calendar.Year2006 21,915.240*** 21,929.960***
(2,826.652) (2,826.863)
Calendar.Year2007 28,603.920*** 28,617.230***
(2,819.952) (2,820.166)
Calendar.Year2008 32,577.590*** 32,590.420***
(2,816.833) (2,817.037)
Calendar.Year2009 37,890.150*** 37,904.000***
(2,812.323) (2,812.534)
Calendar.Year2010 41,064.560*** 41,077.630***
(2,810.306) (2,810.519)
Calendar.Year2011 43,725.490*** 43,739.210***
(2,807.960) (2,808.177)
Calendar.Year2012 46,419.780*** 46,433.590***
(2,807.480) (2,807.701)
Calendar.Year2013 48,718.290*** 48,732.480***
(2,808.908) (2,809.139)
Calendar.Year2014 51,732.380*** 51,746.830***
(2,807.829) (2,808.066)
Calendar.Year2015 53,331.590*** 53,346.510***
(2,805.817) (2,806.055)
Calendar.Year2016 56,443.620*** 56,458.290***
(2,805.224) (2,805.464)
Calendar.Year2017 59,664.220*** 59,679.830***
(2,808.626) (2,808.864)
Employer Yes Yes
Observations 94,051 94,051
R2 0.463 0.463
Adjusted R2 0.462 0.462
Residual Std. Error 22,465.490 22,465.810
F Statistic 715.326*** 690.862***
Note: p<0.1; p<0.05; p<0.01

6 Test to see violation of model assumptions

  • Error should follow normal distribution

Normal Q-Q plot does not look linear. Therefore the error is not following normal distribution.

par(mfrow = c(2,2))
plot(md)

  • No autocorrelation

D-W test value is close to 2. Hence, this assumption is satisfied.

dwtest(md)
## 
##  Durbin-Watson test
## 
## data:  md
## DW = 1.8915, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

7 Why so?

Here are some related research investigating the last name premium problem:

van Praag, M., & van Praag, B. M. (2004). First-Author Determinants and the Benefits of Being Professor A (and not Z).

“We find that author groups clearly account for the relative individual costs and benefits of being a first author when positioning authors of a group: Economists perceive name ordering as a deliberate decision. The usefulness of this deliberation is supported by the answer to our second question: is scientific output higher for authors whose names rank first in the alphabet? Career prospects are indeed better for academic economists who have high chances to be a first author. This productivity effect does not become visible at the beginning of one’s career: it turns out to be a reputation or visibility effect and it therefore only shows up when an economist’s career is more advanced.”

Efthyvoulou, G. (2008). Alphabet economics: The link between names and reputation. The Journal of Socio-Economics, 37(3), 1266-1285.

“Overall, we find that faculty members with earlier last name initials are more likely to get employment at high standard research departments. Furthermore, we show that the relationship between alphabetical placement and academic success remains significant if we use as an alternative measure of reputation the number of people showing an interest in the papers of a particular academic.”

8 Adjust salary by CPI

The previous results do not adjust salary by CPI. I will do it here to see if that affects the main findings. All salaries are adjusted to the 2017 level.

cpi<-read.csv("CanadaCPI.csv")
cpi$year<-as.numeric(substr(cpi$Quarter,1,4))
d.cpi<-data.frame(year=1993:2017,CPI=c(by(cpi$CPI.Inflation,cpi$year,mean)))
d.cpi$pct<-(d.cpi$CPI/100+1)

d.cpi$to2017<-1
for (i in 1993:2016) {
d.cpi$to2017[d.cpi$year==i] <- prod(d.cpi$pct[d.cpi$year %in% i:2017])
}

d.merge<-merge(x = d,y = d.cpi,by.x ="Calendar.Year", by.y = "year",all.x = T )

d<-mutate(d.merge,
          Salary.Paid=Salary.Paid*to2017,
          Taxable.Benefits=Taxable.Benefits*to2017
          )

Here are the results with inflation adjusted salary.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year
)

md.add.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         First_Name+
         Job.Title+
         Employer+
         Calendar.Year)
         
stargazer(md,md.add.first.name,
          omit = "Employer",
          omit.labels = "Employer",
          intercept.bottom = F,
          df = FALSE,
          type="html")
Dependent variable:
Salary.Paid
(1) (2)
Constant 108,605.600*** 108,536.000***
(3,703.862) (3,706.282)
Last_NameF-J -1,492.545*** -1,491.028***
(247.126) (247.180)
Last_NameK-O -886.264*** -885.580***
(224.596) (224.668)
Last_NameP-T -342.091 -339.603
(229.265) (229.321)
Last_NameT-Z -1,223.407*** -1,218.265***
(300.101) (300.332)
First_NameF-J 124.782
(229.731)
First_NameK-O -19.573
(227.736)
First_NameP-T 190.556
(222.306)
First_NameT-Z -107.331
(370.509)
Job.TitleAssociate Professor 12,431.790*** 12,429.340***
(299.786) (299.813)
Job.TitleProfessor 37,403.640*** 37,397.000***
(310.467) (310.603)
Calendar.Year1998 50.616 67.260
(3,308.584) (3,308.714)
Calendar.Year1999 -1,821.808 -1,806.956
(3,252.645) (3,252.761)
Calendar.Year2000 -3,309.803 -3,296.403
(3,161.633) (3,161.788)
Calendar.Year2001 -5,453.563* -5,442.926*
(3,158.598) (3,158.742)
Calendar.Year2002 -7,664.755** -7,654.777**
(3,127.811) (3,127.949)
Calendar.Year2003 -5,211.303* -5,200.639*
(3,090.861) (3,091.018)
Calendar.Year2004 -13,498.750*** -13,484.050***
(3,089.949) (3,090.179)
Calendar.Year2005 -3,682.918 -3,667.822
(3,066.692) (3,066.904)
Calendar.Year2006 -3,756.728 -3,740.102
(3,066.829) (3,067.055)
Calendar.Year2007 959.974 975.481
(3,059.558) (3,059.789)
Calendar.Year2008 2,277.027 2,291.945
(3,056.175) (3,056.395)
Calendar.Year2009 4,521.858 4,538.341
(3,051.282) (3,051.509)
Calendar.Year2010 7,640.857** 7,656.226**
(3,049.093) (3,049.322)
Calendar.Year2011 7,848.134*** 7,864.311***
(3,046.548) (3,046.781)
Calendar.Year2012 6,795.538** 6,811.963**
(3,046.027) (3,046.265)
Calendar.Year2013 7,100.251** 7,117.161**
(3,047.576) (3,047.825)
Calendar.Year2014 9,068.066*** 9,085.208***
(3,046.406) (3,046.661)
Calendar.Year2015 8,043.251*** 8,061.040***
(3,044.223) (3,044.479)
Calendar.Year2016 9,684.371*** 9,701.781***
(3,043.579) (3,043.839)
Calendar.Year2017 8,131.262*** 8,149.733***
(3,047.270) (3,047.527)
Employer Yes Yes
Observations 94,051 94,051
R2 0.418 0.418
Adjusted R2 0.417 0.417
Residual Std. Error 24,374.350 24,374.680
F Statistic 596.978*** 576.565***
Note: p<0.1; p<0.05; p<0.01

9 Male Names vs. Female Names

It would be interesting to see how gender is related to salary. However, the Ontario public sector salary data has no gender. But it does have names. So I matched the first names with the top 100 male names and female names in Canada according to this website. For names not in the top 100 list, the gender variables are “Gender Unknown”. I hope there are not many males using female names or vice versa.

top100names<-read.csv("top100names.csv",colClasses = c("integer","character","character"))
top100names<-top100names %>% 
mutate(Boy = tolower(Boy)) %>% 
  mutate(Girl = tolower(Girl)) 
set.top.male.name<-c(top100names$Boy)  
set.top.female.name<-c(top100names$Girl)  




d <-d %>% 
  mutate(First.Name = tolower(First.Name)) %>% 
  mutate(Last.Name = tolower(Last.Name)) %>% 
  mutate(top_100_male_name = case_when(First.Name %in%  set.top.male.name~ "YES",
                                       TRUE ~ "NO")) %>% 
  mutate(top_100_female_name = case_when(First.Name %in% set.top.female.name ~ "YES",
                                       TRUE ~ "NO")) %>% 
  mutate(Gender = case_when(top_100_male_name == "YES" ~ "Male",
                          top_100_female_name == "YES" ~ "Female",
                                       TRUE ~ "Gender Unknown"))

9.1 Plots

Let’s plot the data.

d.sub<-subset(d, Job.Title %in% c("Assistant Professor", "Associate Professor", "Professor"))
ggplot(d.sub,aes(Job.Title,Salary.Paid))+
  geom_boxplot()

ggplot(d.sub,aes(Gender,Salary.Paid))+
  geom_boxplot()

d.sub$Calendar.Year<-as.numeric(d.sub$Calendar.Year)
ggplot(d.sub,aes(Calendar.Year,Salary.Paid,color=Gender))+
  geom_point(alpha=0.05)+
  stat_smooth(size=1)+
  facet_grid(~Job.Title)+
  theme(legend.position="bottom")

9.2 Results

Here are the regression results. Column (3) adds gender and female is the reference group. The results suggests that males are paid $7,315 more than females after controlling the alphabetical order of last names, job titles, employer, and year. Of course, these results are preliminary and the model only explains 41% of the salary variations. Are girls doing the same job as guys but got paid less? Then from an employer’s point of view, they should hire more girls. Then there will be more demand for female workers. As employers compete for female workers by offering them more wage, this would push up female salary. So what is causing this $7,315 salary gap per year between male and female?

Salary.Paid
(1) (2) (3)
Constant 98,243.520*** 98,164.910*** 92,131.070***
(2,168.682) (2,171.705) (2,245.461)
Last_NameF-J -1,498.886*** -1,498.243*** -1,511.285***
(248.313) (248.367) (248.232)
Last_NameK-O -904.142*** -904.453*** -897.833***
(225.684) (225.757) (225.616)
Last_NameP-T -358.774 -356.869 -379.615*
(230.371) (230.426) (230.289)
Last_NameT-Z -1,229.256*** -1,225.713*** -1,248.908***
(301.554) (301.786) (301.622)
Job.TitleAssociate Professor 12,415.990*** 12,413.560*** 12,413.890***
(301.182) (301.209) (300.993)
Job.TitleProfessor 36,547.560*** 36,540.660*** 36,489.230***
(310.130) (310.272) (310.084)
First_NameF-J 162.720 -50.453
(230.824) (231.382)
First_NameK-O 11.193 -43.516
(228.834) (229.344)
First_NameP-T 212.680 220.567
(223.380) (225.648)
First_NameT-Z -62.995 -230.031
(372.294) (372.302)
GenderGender Unknown 5,798.240***
(585.245)
GenderMale 7,315.277***
(629.987)
Year Dummies? Yes Yes Yes
Employer Dummies? Yes Yes Yes
Observations 94,051 94,051 94,051
R2 0.412 0.412 0.413
Adjusted R2 0.412 0.412 0.412
Residual Std. Error 24,493.210 24,493.540 24,475.970
F Statistic 700.818*** 672.210*** 661.081***
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

10 A dummy for each last name initial

One of my favorite economists Professor Marc F. Bellemare suggests

one dummy for each letter and see if it’s generally monotonic.

Great suggestion! Let’s generate one dummy for each last name initial and repeat our main model. The results are reported in column (4).

Salary.Paid
(1) (2) (3) (4)
Constant 98,243.520*** 98,164.910*** 92,131.070*** 90,970.680***
(2,168.682) (2,171.705) (2,245.461) (2,279.831)
Last_NameF-J -1,498.886*** -1,498.243*** -1,511.285***
(248.313) (248.367) (248.232)
Last_NameK-O -904.142*** -904.453*** -897.833***
(225.684) (225.757) (225.616)
Last_NameP-T -358.774 -356.869 -379.615*
(230.371) (230.426) (230.289)
Last_NameT-Z -1,229.256*** -1,225.713*** -1,248.908***
(301.554) (301.786) (301.622)
Job.TitleAssociate Professor 12,415.990*** 12,413.560*** 12,413.890*** 12,413.980***
(301.182) (301.209) (300.993) (300.847)
Job.TitleProfessor 36,547.560*** 36,540.660*** 36,489.230*** 36,495.860***
(310.130) (310.272) (310.084) (309.965)
First_NameF-J 162.720 -50.453 -20.195
(230.824) (231.382) (231.421)
First_NameK-O 11.193 -43.516 -35.168
(228.834) (229.344) (229.596)
First_NameP-T 212.680 220.567 250.036
(223.380) (225.648) (225.928)
First_NameT-Z -62.995 -230.031 -71.392
(372.294) (372.302) (373.578)
GenderGender Unknown 5,798.240*** 5,781.004***
(585.245) (585.309)
GenderMale 7,315.277*** 7,269.344***
(629.987) (630.034)
Last.Name.startB 1,671.008***
(503.257)
Last.Name.startC 1,647.468***
(518.380)
Last.Name.startD 84.652
(553.682)
Last.Name.startE 1,349.730*
(738.823)
Last.Name.startF 573.804
(609.052)
Last.Name.startG -1,397.618**
(553.792)
Last.Name.startH -872.340
(532.563)
Last.Name.startI 861.369
(1,021.664)
Last.Name.startJ 1,126.054*
(674.941)
Last.Name.startK 230.385
(552.170)
Last.Name.startL -163.566
(537.964)
Last.Name.startM 187.433
(491.053)
Last.Name.startN -105.547
(690.373)
Last.Name.startO 2,136.139***
(783.196)
Last.Name.startP 447.292
(557.602)
Last.Name.startQ -3,299.892*
(1,689.085)
Last.Name.startR 1,712.001***
(570.085)
Last.Name.startS 474.608
(492.850)
Last.Name.startT 857.377
(604.445)
Last.Name.startU -8,422.304***
(1,815.731)
Last.Name.startV -534.440
(739.546)
Last.Name.startW 438.395
(544.148)
Last.Name.startX -10,776.280***
(1,753.550)
Last.Name.startY 76.149
(884.721)
Last.Name.startZ 1,052.583
(880.917)
Year Dummies? Yes Yes Yes Yes
Employer Dummies? Yes Yes Yes Yes
Observations 94,051 94,051 94,051 94,051
R2 0.412 0.412 0.413 0.414
Adjusted R2 0.412 0.412 0.412 0.413
Residual Std. Error 24,493.210 24,493.540 24,475.970 24,460.630
F Statistic 700.818*** 672.210*** 661.081*** 548.181***
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

10.1 Rare last name ranked lower in alphabetical order coincide with race

Let’s look at the distribution of the last name initials.

plot(table(d$Last.Name.start),xlab="Last Name Initial", ylab="Number of Observations")

Sort the last name initials by the number of observations.

sort(table(d$Last.Name.start), decreasing = T)
M S B C H L D P G W R K F T A J N V E O Y Z I Q U X
128684 106039 101749 89644 67218 65279 63820 59601 57196 55843 53884 49970 41377 41136 38861 24291 21905 19934 17830 16360 8945 8832 7397 2888 2309 702

Some of the last names (i.e. names started with “U” and “X”) are rare and may coincide with the employee’s race and cultural background. This would screw up the model as we have no control on employees race (race data not available in our dataset).

We can only select those popular last name initials with more observations (less likely to be a minority), and see whether our findings remain valid.

10.2 Last name with more observations

Select the most populated last name initials. They are:

set<-names(sort(table(d$Last.Name.start), decreasing = T))[1:14]

set.n.1<-sort(set,decreasing = F)[2:length(set)]
set
##  [1] "M" "S" "B" "C" "H" "L" "D" "P" "G" "W" "R" "K" "F" "T"

Here are the results. It appears that our findings remain valid. Your salary is related to the alphabetical order of your last name.

d.sub2<-subset(d.sub,Last.Name.start %in% set)

d.sub<-d.sub2
md <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year)

md.first.name <- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year+
         First_Name)

md.gender<- lm(data = d.sub,
         Salary.Paid ~ Last_Name+
         Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
          Gender )

md.all.letters<- lm(data = d.sub,
         Salary.Paid ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.100k<- lm(data = d.sub,
         (Salary.Paid-100000) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.log<- lm(data = d.sub,
         log(Salary.Paid) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)

md.all.letters.log.100k<- lm(data = d.sub,
         log(Salary.Paid-100000) ~Job.Title+
         Employer+
         Calendar.Year+
         First_Name+
         Gender+
         Last.Name.start)


stargazer(md,
          md.first.name,
          md.gender,
          md.all.letters,
          md.all.letters.100k,
          md.all.letters.log,
          md.all.letters.log.100k,
          omit = c("Calendar.Year","Employer"),
          omit.labels = c("Year Dummies?","Employer Dummies?"),
          df = FALSE,
          type="html",
          intercept.bottom = F,
          style = "aer",covariate.labels = c())
Salary.Paid (Salary.Paid - 1e+05) log(Salary.Paid) log(Salary.Paid - 1e+05)
(1) (2) (3) (4) (5) (6) (7)
Constant 100,290.400*** 100,151.500*** 93,735.960*** 93,971.180*** -6,028.819** 11.543*** 9.517***
(2,543.269) (2,546.127) (2,621.864) (2,628.167) (2,628.167) (0.015) (0.057)
Last_NameF-J -2,000.986*** -2,010.717*** -2,018.406***
(275.956) (276.000) (275.851)
Last_NameK-O -1,184.096*** -1,195.836*** -1,180.312***
(249.172) (249.248) (249.104)
Last_NameP-T -492.425** -501.543** -516.948**
(244.961) (245.020) (244.880)
Last_NameT-Z -828.129** -841.560** -862.727**
(387.678) (387.841) (387.555)
Job.TitleAssociate Professor 12,203.100*** 12,202.730*** 12,199.920*** 12,203.580*** 12,203.580*** 0.091*** 0.455***
(331.275) (331.317) (331.055) (331.016) (331.016) (0.002) (0.007)
Job.TitleProfessor 36,450.200*** 36,448.150*** 36,385.650*** 36,379.050*** 36,379.050*** 0.241*** 0.911***
(341.119) (341.320) (341.096) (341.089) (341.089) (0.002) (0.007)
First_NameF-J 288.581 58.985 64.043 64.043 -0.000 -0.002
(252.828) (253.455) (253.523) (253.523) (0.001) (0.006)
First_NameK-O 245.800 176.575 173.145 173.145 0.001 0.003
(251.750) (252.259) (252.514) (252.514) (0.001) (0.005)
First_NameP-T 10.766 18.393 33.301 33.301 0.000 0.005
(244.029) (246.469) (246.635) (246.635) (0.001) (0.005)
First_NameT-Z 682.197* 501.337 530.777 530.777 0.004* 0.020**
(409.707) (409.703) (410.345) (410.345) (0.002) (0.009)
GenderGender Unknown 6,174.504*** 6,150.345*** 6,150.345*** 0.040*** 0.131***
(645.630) (645.710) (645.710) (0.004) (0.014)
GenderMale 7,721.053*** 7,703.478*** 7,703.478*** 0.050*** 0.160***
(692.797) (692.937) (692.937) (0.004) (0.015)
Last.Name.startC -44.973 -44.973 0.000 0.007
(408.275) (408.275) (0.002) (0.009)
Last.Name.startD -1,560.544*** -1,560.544*** -0.009*** -0.028***
(451.858) (451.858) (0.003) (0.010)
Last.Name.startF -1,096.777** -1,096.777** -0.006** -0.023**
(518.897) (518.897) (0.003) (0.011)
Last.Name.startG -3,093.732*** -3,093.732*** -0.017*** -0.043***
(452.659) (452.659) (0.003) (0.010)
Last.Name.startH -2,573.824*** -2,573.824*** -0.014*** -0.038***
(425.684) (425.684) (0.003) (0.009)
Last.Name.startK -1,432.061*** -1,432.061*** -0.007*** -0.011
(450.390) (450.390) (0.003) (0.010)
Last.Name.startL -1,856.081*** -1,856.081*** -0.009*** -0.018*
(433.357) (433.357) (0.003) (0.009)
Last.Name.startM -1,486.958*** -1,486.958*** -0.007*** -0.016*
(372.459) (372.459) (0.002) (0.008)
Last.Name.startP -1,218.943*** -1,218.943*** -0.007** -0.017*
(457.532) (457.532) (0.003) (0.010)
Last.Name.startR 24.086 24.086 0.001 0.008
(472.281) (472.281) (0.003) (0.010)
Last.Name.startS -1,200.992*** -1,200.992*** -0.005** -0.005
(374.686) (374.686) (0.002) (0.008)
Last.Name.startT -829.469 -829.469 -0.004 -0.026**
(514.040) (514.040) (0.003) (0.011)
Last.Name.startW -1,256.888*** -1,256.888*** -0.006** -0.014
(440.781) (440.781) (0.003) (0.010)
Year Dummies? Yes Yes Yes Yes Yes Yes Yes
Employer Dummies? Yes Yes Yes Yes Yes Yes Yes
Observations 78,538 78,538 78,538 78,538 78,538 78,538 78,538
R2 0.413 0.413 0.414 0.414 0.414 0.485 0.568
Adjusted R2 0.412 0.412 0.413 0.413 0.413 0.484 0.568
Residual Std. Error 24,518.980 24,518.950 24,499.580 24,495.420 24,495.420 0.144 0.533
F Statistic 587.100*** 563.182*** 554.051*** 508.804*** 508.804*** 676.530*** 946.893***
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Here are the plot of the residuals.

Residuals for the model where salary is unchanged.

par(mfrow = c(2,2))
plot(md.all.letters)

Residuals for the model where $100,000 is subtracted from salary. This is because the our dataset only includes salaries higher than $100,000 per year.

par(mfrow = c(2,2))
plot(md.all.letters.100k)

Residuals for the model where salary is in the log form.

par(mfrow = c(2,2))
plot(md.all.letters.log)

Residuals for the model where $100,000 is subtracted from salary, and then in log form.

par(mfrow = c(2,2))
plot(md.all.letters.log.100k)

Let’s see whether the effect, as hypothesized by Professor Marc Bellemare, would be generally monotonic.

d.tmp<-data.frame(Last.Name.Initial=set.n.1,
                  Coefficient=coef(md.all.letters)[98:(length(set.n.1)+98-1)],
                  p.value=summary(md.all.letters)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"

ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
  geom_col(aes(fill =significant10))+
  ylab("Salary Drop Compared to the Reference Last Name Initial ($)")+
  xlab("Last Name Initial")

Compared to employees with Last names started with B, employee Cs are not experiencing significant salary drop. However, for employee Ds and Fs, the salary drop is $1,560 and $1,096 per year. This adverse effect strengthens for professor Gs, reducing his salary by $3,093 per year compared to Bs. After that, the adverse effect shrinks to $2,500, $2,000, and then $1,500 as the last names started further away from B. Eventually, the adverse last name effect stops at about $1,200 per year. So we could conclude that the adverse last name effect increases first and then decreases. This makes sense. Think of a case where we need to read something quickly: we would look at the first few pages, skip the middle, and then roughly read the last few pages. Thus, the adverse effect would be the largest if the last name is in the middle.

How about the last model where we subtracted $10,000 from the salary variable and then take a log? The R square increased to 57%, suggesting that this model fits the data better. Let’s plot the estimated coefficients of the impact of last name initial on salary. It shows a similar pattern.

d.tmp<-data.frame(Last.Name.Initial=set.n.1,
                  Coefficient=c(coef(md.all.letters.log.100k)[98:(length(set.n.1)+98-1)]),
                  p.value=summary(md.all.letters.log.100k)$coefficients[98:(length(set.n.1)+98-1),4])
d.tmp$significant10<-"Not Significant"
d.tmp$significant10[d.tmp$p.value<=0.1]<-"Significant at 10%"

ggplot(data = d.tmp,aes(Last.Name.Initial,Coefficient))+
  geom_col(aes(fill=significant10))+
  ylab("% Drop on (salary - 10k) Compared to the Reference Last Name Initial")+
  xlab("Last Name Initial")

Well, my father may be mad. But next time you see me, Max Shang is no longer my name. I’m Max Aaaa.