# W271.4 - Assignment #2
# Completed by Jared Maslin, Sasanka Gandavarapu, and Max Yan
# Load the dataset provided, "401k_w271.RData", and look at the description
load("~/Downloads/data/401k_w271.Rdata")
desc
## variable label
## 1 prate participation rate, percent
## 2 mrate 401k plan match rate
## 3 totpart total 401k participants
## 4 totelg total eligible for 401k plan
## 5 age age of 401k plan
## 6 totemp total number of firm employees
## 7 sole = 1 if 401k is firm's sole plan
## 8 ltotemp log of totemp
# Let's load some basic libraries
library(car)
library(sandwich)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# (1) Your dependent variable will be "prate", representing the fraction of a company's employees participating in its 401k plan. Because this variable is bounded between 0 and 1, a linear model without any transformations may not be the most ideal way to analyze the data, but we can still learn a lot from it. Examine the prate variable and comment on the shape of its distribution.
# Examine "prate" with its distribution and remark on its appearance
# Start with a summary
summary(data$prate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 78.10 95.70 87.56 100.00 200.00
# Look at the quantile values at several intervals
quantile(data$prate, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 31.763 53.995 62.760 78.100 95.700 100.000 100.000 100.000 100.000
# Plot a histogram for "prate"
hist(data$prate, main='Histogram of 401k Participation Rates', xlab='Participation Rate', ylab='Frequency', breaks = 20)

# Notice that the resulting plot shows a large portion of the observations where prate = 100.0. Also, there are a few observations with participation rates greater than 100%, which sounds inappropriate as an entity should not generally see an amount of participant greater than the amount potential participants.
# Let's try removing values of prate greater than 100 to see how it affects the distribution
summary(data$prate[data$prate <= 100])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 78.05 95.70 87.35 100.00 100.00
quantile(data$prate[data$prate <= 100], probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 31.73 53.95 62.70 78.05 95.70 100.00 100.00 100.00 100.00
hist(data$prate[data$prate <= 100], main='Histogram of 401k Participation Rates', xlab='Participation Rate', ylab='Frequency', breaks = 20)

# (2) Your independent variable will be mrate, the rate at which a company matches employee 401k contributions. Examine this variable and comment on the shape of its distribution.
# Examine the variable "mrate" in the same manner as was used above
summary(data$mrate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.3000 0.4600 0.7315 0.8300 4.9100
quantile(data$mrate, probs = c(0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95,0.99), na.rm = TRUE)
## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 0.0300 0.1100 0.1600 0.3000 0.4600 0.8300 1.6570 2.3635 4.1267
hist(data$mrate, main='Histogram of 401k Match Rates', xlab='Match Rate', ylab='Frequency', breaks = 20)

# (3) Generate a scatterplot of prate against mrate. Then estimate the linear regression of prate on mrate. What slope coefficient did you get?
# Since the distribution of "prate" seems more appropriate when removing values over 100%, let's define a new data subset for tests going forward
dataClean = data[data$prate <= 100, ]
# Now, let's generate the scatterplot and examine the result
scatterplot(dataClean$mrate, dataClean$prate)

modelQ3 = lm(prate ~ mrate, data = dataClean)
summary(modelQ3)
##
## Call:
## lm(formula = prate ~ mrate, data = dataClean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.289 -8.200 5.186 12.723 16.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.0618 0.5641 147.24 <2e-16 ***
## mrate 5.8623 0.5275 11.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.09 on 1529 degrees of freedom
## Multiple R-squared: 0.07475, Adjusted R-squared: 0.07414
## F-statistic: 123.5 on 1 and 1529 DF, p-value: < 2.2e-16
# From the summary function, we see an estimated coefficient on "mrate" of 5.8623, which is a whole number representation of the plan's match rate
# (4) Is the assumption of zero-conditional mean realistic? Explain your evidence. What are the implications for your OLS coefficients?
# We can first investigate the fitted values of our model from question 3 against the corresponding residuals for insight...
plot(modelQ3$fitted, modelQ3$resid)

# The plot shows a negative trend, with the residuals decreasing as the fitted values increase.
# This suggests to us that the assumption of zero-conditional mean may not be realistic.
# One potential implication of the zero-conditional mean assumption being unrealistic is that the p-value returned from our model could become unrealiable.
# (5) Is the assumption of homoskedasticity realistic? Provide at least two pieces of evidence to support your conclusion. What are the implications for your OLS analysis?
# First, our "fitted vs. residuals" plot from question 4 shows that the residuals decrease as the fitted values increase. This could cause our error terms to be off and our confidence intervals to be too wide.
# Second, ***could we reference the shape of the original scatterplot and the exponential-like appearance?***
# Potential implications here are that our error terms may be too large and our confidence intervals may be too wide.
# (6) Is the assumption of normal errors realistic? Provide at least two pieces of evidence to support your conclusion. What are the implications for your OLS analysis?
# First, looking at our plot of the residuals alone, we notice a clearly negative skew, pointing to a potential lack of error normality.
hist(modelQ3$resid)

# Second, ***need an additional observation here***
# ***Any implications to note?***
# (7) Based on the above considerations, what is the standard error of your slope coefficient?
# Since we have not been able to argue for our OLS assumptions being realistic, our main course of action would be to consider the robust standard error of our model from question 3:
coeftest(modelQ3, vcov=vcovHC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.06179 0.61310 135.479 < 2.2e-16 ***
## mrate 5.86227 0.47015 12.469 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# From this, we observe a standard error for our slope coefficient of 0.47015
# (8) Is the effect you find statistically significant, and is it practically significant?
# Our model from question 3 (linear regression of prate on mrate) gave us a p-value of <2e-16 ***, which suggests a statistically significant result.
# For practical significance, our linear regression gave us an estimated coefficient on mrate of 5.8623.
# This suggests to us that, for our model, every 1 unit increase in a plan's match rate will correspond to an increase of 5.8623 units in the plan participation rate.
# Since we're talking about an additional 5% to 6% increase in the plan's participation rate (and that a plan can correspond to thousands of employees, in many cases) per unit increase in the match rate, I would argue that the result is practically significant, as well.