Introduction
The national attainment rate for post-secondary education, including quality credentials, as of 2017 is 47.6% [1]. On average, every year Americans are gradually earning a degree beyond high school or a credential that provides them a skill to elevate their role in the workforce. Prior data suggests a strong correlation between educational attainment and income: as education increases, people become specialized in a certain field and earn a higher income.
I am interested in assessing the complex relationship between education and income by looking at the “turnout” dataset, imported from the Zelig package. The Zelig package provides a framework of statistical models on a unified interface. To assess the relations between the independent variable (education) and the dependent variable (income) I will be conducting maximum likelihood estimations and comparing results from Slide 4.15 and Slide 4.19. Slide 4.15 maximizes the likelihood of education’s impact on the mean income (mu). Slide 4.19, maximizes the likelihood of education’s impact on the variance of income, or income inequality (sigma).
Maximum likelihood estimations determine the parameter values (such as the mean, slope, y-intercept, or variance) from a data sample. From these estimates, it is able to maximize the probability of obtaining the observed data.
library(Zelig)
## Loading required package: survival
library(maxLik)
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
The “turnout” dataset is imported from the Zelig package. The variables I will be looking at are education and income. Education serves as the independent variable, while income is a dependent variable. When finding the probability mass function (PMF), income will be treated as a discreet variable and when finding the probability density function, income will be treated as a continuous variable.
data("turnout")
head(turnout)
## race age educate income vote
## 1 white 60 14 3.3458 1
## 2 white 51 10 1.8561 0
## 3 white 24 12 0.6304 0
## 4 white 38 8 3.4183 1
## 5 white 25 12 2.7852 1
## 6 white 67 12 2.3866 1
Impact of Education on Mean Income (Slide 4.15)
ols.lf <- function(param) {
beta <- param[-1]
sigma <- param[1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate)
mu <- x%*%beta
sum(dnorm(y, mu, sigma, log = TRUE))}
param <- c(1,2,3,4,5)
param[-1]
## [1] 2 3 4 5
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1), method = "nm")
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Nelder-Mead maximization, 178 iterations
## Return code 0: successful convergence
## Log-Likelihood: -4691.257
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 2.52584 0.03992 63.274 < 2e-16 ***
## beta1 -0.65328 0.20518 -3.184 0.00145 **
## beta2 0.37636 0.01641 22.937 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
Slide 4.15 displays the relationship between education and average income. There are three parameters in this first log-likelihood function (sigma, beta1, beta2). Sigma represents the standard deviation of income and is 2.526. Beta1 refers to the y-intercept for the mean. Y-intercept is -0.653. This y-intercept states that when education is equal to 0, the income value is -0.653. Beta2 estimates the slope of the mean, which is 0.376. According to the slope value, when education increases by one unit, then income value increases by 0.376 units. These results sugggest a positive correlation between education and average income; as education increases, so does income increase.
Impact of Education on Standard Deviation of Income (Slide 4.19)
ols.lf2<-function(param){
mu <- param[1]
theta <- param[-1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate)
sigma <- x%*%theta
sum(dnorm(y, mu, sigma, log=TRUE))}
mle_ols2 <- maxLik(logLik = ols.lf2, start = c(mu = 1, theta1 = 1, theta2 = 1), method = "nm")
summary(mle_ols2)
## --------------------------------------------
## Maximum Likelihood estimation
## Nelder-Mead maximization, 154 iterations
## Return code 0: successful convergence
## Log-Likelihood: -4861.964
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## mu 3.516252 0.070582 49.82 <2e-16 ***
## theta1 1.462007 0.107430 13.61 <2e-16 ***
## theta2 0.109056 0.009234 11.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
Slide 4.19 displays the relationship between education and income inequality. There are three population parameters in this likelihood estimation (mu, theta1, and theta2). The mu or mean value for income is 3.516. Theta1 now refers to y-intercept and is 1.462. Theta2 represents the slope as 0.109. When education is equal to 0, y-intercept indicates that income varies by 1.462 across individuals.The slope value states that when education increases by one unit, the standard deviation for income increases by 0.109. These results also suggest a positive correlation between education and income inequality, thus as education increases, there will be a higher variation among income.
Age as Second Independent Variable
This function will assess the relationship between age, education, and income.
ols.lf3<-function(param){
beta <- param[-1]
sigma <- param[1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate, turnout$age)
mu <- x%*%beta
sum(dnorm(y, mu, sigma, log=TRUE))}
mle_ols3<-maxLik(logLik=ols.lf3, start=c(sigma=1, beta1=1, beta2=1, beta3=1), method = "nm")
summary(mle_ols3)
## --------------------------------------------
## Maximum Likelihood estimation
## Nelder-Mead maximization, 123 iterations
## Return code 0: successful convergence
## Log-Likelihood: -4702.483
## 4 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 2.676915 0.046455 57.624 < 2e-16 ***
## beta1 0.561730 0.322201 1.743 0.081261 .
## beta2 0.327644 0.018671 17.548 < 2e-16 ***
## beta3 -0.012936 0.003602 -3.591 0.000329 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
After adding a second independent variable (age) to the first statistical model, I can further understand the relationship between income and education. I predict that age will have a positive correlation with income, much like education. As individuals get older, they may decide to go back to college to earn a degree/cerification that elevates their position in their job field. Young mothers may return to college after their children grow up. With the return to school at a later age, income will most likely increase as individuals become more skilled and experienced.
In the model above, there are four parameters (sigma, beta1, beta2, and beta3. Sigma or standard deviation is 2.677. The y-intercept 0.562. When education and age are equal to zero, the mean income value is 0.562. The slope is 0.328. This indicates that for every one unit increase in education, there is an increase in income by 0.328. However, beta3 is a negative coefficient (-0.013). This negative coefficient suggests my original prediction is incorrect. There is a negative correlation between age and income. As age increases, income decreases by -0.013.
Age as Second Independent Variable
ols.lf4 <- function(param){
mu <- param[1]
theta <- param[-1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate, turnout$age)
sigma <- x%*%theta
sum(dnorm(y, mu, sigma, log=TRUE))}
mle_ols4<-maxLik(logLik=ols.lf4, start=c (mu=1, theta1=1, theta2=1, theta3=1), method = "nm")
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
summary(mle_ols4)
## --------------------------------------------
## Maximum Likelihood estimation
## Nelder-Mead maximization, 269 iterations
## Return code 0: successful convergence
## Log-Likelihood: -4866.043
## 4 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## mu 3.221451 0.069045 46.657 < 2e-16 ***
## theta1 -0.722354 0.123845 -5.833 5.45e-09 ***
## theta2 0.187446 0.007655 24.486 < 2e-16 ***
## theta3 0.029008 0.002344 12.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
There are four parameters in the likelihood function above (mu, theta1, theta2, and theta3). Mu or average income is 3.222. The y-intercept is -0.722 suggesting a negative correlation between income when age and education are equal to zero. When age and education are equal to zero, income varies by -0.723. The slope is 0.188, which means that as education increase by one unit, so income variance will increase by 0.188. Theta3, the slope for age, is 0.031. This slope indicates that as age increases by one unit, so will income variance increase by 0.031. Although the data suggests that age and education have a positive correlation with income inequality, the results are not statistically significant.
References