Empirical Strategy

Model One Simple Regression

I am trying to estimate what impacts the percentage of employees who participate in a company’s 401k plan.

Using a simple linear regression, I am choosing to try and predict the percentage of employees who participate in the 401k plan based on the rate at which the employer matches employee contributions to the plan. I choose this form because logically employees are more likely to participate in a plan where the employer contributes more.

\(prate_{i} = \beta_{0} + \beta_{1}mrate_{i} + \epsilon_{i}\)

Model Two Multiple Regression

My second regression is a combination of the first with an addition of the log of total employees eligible. I am trying to predict the percentage of employees who participate in the 401k plan based on the rate at which the employer matches employee contributions to the plan and the log of the total number of employees eligible to participate in the plan. The log of the total employees eligible will help make the distribution more normal and reduce effect of outliers.

\(prate_{i} = \beta_{0} + \beta_{1}mrate_{i} + \beta_{2}ltotelg_{i} +\epsilon_{i}\)

Model Three Multiple Polynomial Regression

My third model takes everything from the second model and adds a squared match rate coefficient to the model. I am adding in the squared term because I know that the participation rate cannot go over 100 percent thus after a certain rate the employer contributions will no longer increase the participation rate because the maximum will already be reached.

\(prate_{i} = \beta_{0} + \beta_{1}mrate_{i} + \beta_{2}ltotelg_{i} + \beta_{3}mrate^2_i +\epsilon_{i}\)

Data

The data is cross-sectional data where the unit of analysis is a retirement plan (401k). The data is limited to only a few different measures about the plan itself. Thus, I must assume increases in match rate is the most important causal relationship of participation which is a large assumption and leads to bias in my regressions. I would have liked to more data on things like the industry of the company, average annual salaries, average level of education and other descriptive statistics of those that participated in the plans.

Key Measures

The dependent variable is prate which is the percentage of employees who participate in the plan. The first independent variable is mrate. Mrate is the rate at which the employer matches employee contributions to the plan. For example, a match rate of 0.21 would imply the employer contributes $0.21 for every dollar an employee contributes where as a match rate of 1.42 would imply the employer contributes $1.42 for every dollar the employee contributes. The other independent variable is ltotelg. This is the log of the total number of employees eligible to participate in the plan. Lastly, I want to check to see if the age of the plan being offered effects the model.

Descriptive Statistics

The following table details some of the descriptive statistics of key variables. The total employees eligible is very skewed which is why I chose to take the log of it for my model.

n mean sd median min max range
prate 1534 87.3629075 16.7165374 95.699997 3.000000 100.00000 97.000000
mrate 1534 0.7315124 0.7795393 0.460000 0.010000 4.91000 4.900000
totelg 1534 1628.5345502 5370.7193562 330.000000 51.000000 70429.00000 70378.000000
ltotelg 1534 6.1353126 1.2899017 5.799093 3.931826 11.16236 7.230535

Results

Below are the regressions mentioned above. The significant coefficients look intriguing but are biased and lack practical significance. Interpretation of each model is listed after the tables.

Table 1
Dependent variable:
prate
mrate 5.861***
(0.527)
Constant 83.075***
(0.563)
Observations 1,534
R2 0.075
Adjusted R2 0.074
Residual Std. Error 16.085 (df = 1532)
F Statistic 123.685*** (df = 1; 1532)
Note: p<0.1; p<0.05; p<0.01
Table 2
Dependent variable:
prate
(1) (2)
mrate 5.413*** 15.760***
(0.516) (1.426)
I(mrate2) -2.961***
(0.382)
ltotelg -2.857*** -2.684***
(0.312) (0.307)
Constant 100.934*** 95.682***
(2.023) (2.097)
Observations 1,534 1,534
R2 0.123 0.156
Adjusted R2 0.122 0.154
Residual Std. Error 15.666 (df = 1531) 15.372 (df = 1530)
F Statistic 107.241*** (df = 2; 1531) 94.328*** (df = 3; 1530)
Note: p<0.1; p<0.05; p<0.01

Robustness Checks

One variable I wanted to check to make sure I was not omitting was the age of the plan. I tried this check by using the age from four to ten year plans as a slope dummy variable. For example, my estimated coefficient for my age5 variable was 1.380. This implies plans that are 5 years old have participation rates that are 1.38 percentage points higher than four-year-old plans (the omitted category). The pattern in the coefficients implies the relationship between age and participation is generally positive, meaning participation is generally higher in plans that have been around longer. However, all of these estimates appear to be imprecisely measured. None of the ages are statistically significantly different from 0.

Table 3
Dependent variable:
prate
mrate 6.975***
(0.820)
age5 0.792
(10.311)
age6 2.413
(10.303)
age7 5.057
(10.258)
age8 3.240
(10.293)
age9 5.783
(10.374)
age10 7.951
(10.504)
Constant 76.436***
(10.202)
Observations 898
R2 0.085
Adjusted R2 0.078
Residual Std. Error 17.666 (df = 890)
F Statistic 11.827*** (df = 7; 890)
Note: p<0.1; p<0.05; p<0.01
Res.Df RSS Df Sum of Sq F Pr(>F)
896 281147.5 NA NA NA NA
890 277773.6 6 3373.924 1.801702 0.0957771

Conclusion

Simple Regression

My first model was clearly biased because there is a lot more affecting participation rate in 401k then just the employer match rate. The estimate in table one shows that as the employer match rate increase by one dollar we would estimate a 5.861 percentage point change in the employee participation. As for practical significance a one standard deviation change in mrate only changes the participation rate by about 5.2%. However, as previously mentioned this model is biased because it does not account for other variables.

Multiple Regression

The first model in table two considers the fact that larger companies will have a harder time having a high participation rate in their 401k plans because of their size. To first take care of the skewed distribution coming from the variable accounting for the number of employees eligible for the plan I used the log. This model helps back this idea because the match rate is very similar to table one but the log of the total employees eligible for the plan is negative. This means with a one percent increase in the percentage of employees who are eligible for the plan we would estimate a decrease of .02857 percentage points in the total participation rate. In this estimate the match rate is similar just slightly less than the model in table 1.

However, this is also not a good model because before even looking at the coefficients the constant for the participation rate is already above 100 percent which is not possible.

Multiple Polynomial Regression

Lastly the final model seems to be the best and most accurate description of participation rate in 401k plans from the data I had.

The second model in table two brings up the idea that the participation rate is maxed out at 100% thus eventually no matter how much employers offer for their match rate the participation rate will no longer increase. From table two you can see that the coefficient for the squared term on mrate is negative meaning,

The expected change in employee participation rate when employer match increases by $1 is \(15.760 – 2*2.961mrate\)

It implies participation is initially increases as the employer match rises, but eventually tapers off. The optimal match rate can be found by setting the equation above equal to 0 and solving for mrate. This yields an optimal match rate of approximately $2.66 dollars, while controlling for the total number of employers eligible for the plan.

Final Conclusion

Throughout my regression I can see that participation in employee 401k plans generally increased as the match rate increased. Even though the coefficients seem to be statistically significant at first glance the models lack the needed variables to properly estimate the employee participation in the 401k plans. I am unable to conclude that the match rate is the most important causal relationship of the participation. I see numerous companies with low match rates and high participation rates and also companies with high match rates and low participation rates with respect to the other companies in the data. I can better conclude that generally people tended to respond to the higher incentive in this case match rate, but they do not always respond in the ways we expect.

I was unable to control for multiple different variables that could bias my coefficients. For example, I was unable to control for the average debt of the employees eligible for the plan. If employers are focusing on paying off their debt, they will most likely not participate in the 401k plan. Further research could help find better predictors of participation rate and more companies involved in the study would always be helpful. I would have also liked to have the industry of the company, average annual salaries, average level of education and other descriptive statistics of those that participated in the plans.

Works Cited

Stobierski, T. (2018, March 30). 401(k) Basics: When It Was Invented and How It Works. Retrieved November 29, 2019, from https://www.northwesternmutual.com/life-and-money/your-401k-when-it-was-invented-and-why/.