Sungji Peter Shin
2019-02-24
While the probability function looks for data given parameters, likelihood function estimates the parameters given data. In this assignment, I used ‘turnout’ dataset and Maximum Likelihood Estimation (MLE) method to explore relationship between, initially, income and educate variables and lastly, with second independent varaible, age, added to the initial one.
Below lines of code treat standard deviation as it is constant and find maximum likelihood estimates of y-intercept for mean (beta1) and slope for the mean (beta2). Beta1 or -0.65 represents the y-intecept and means that individuals with no education (x=0) have mean income of -0.65. Beta2 or 0.38 represents the slope and means that increasing value of x (education) by 1 (year, in this case) will result an increased mean income by 0.38. Also, a positive relationship between educate and income is statistically significant, in respect to mean.
library(Zelig)
data(turnout)
ols.lf <- function(param) {
beta <- param[-1]
sigma <- param[1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate)
mu <- x%*%beta
sum(dnorm(y, mu, sigma, log = TRUE))}
library(maxLik)
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1))
summary(mle_ols)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 12 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -4691.256
3 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
sigma 2.52613 0.03989 63.326 < 2e-16 ***
beta1 -0.65207 0.20827 -3.131 0.00174 **
beta2 0.37613 0.01663 22.612 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
Below lines of code treat mean as it is constant and find maximum likelihood estimates of y-intercept for standard deviation (theta1) and slope for the standard deviation (theta2). Theta1 or 1.46 represents the y-intercept and means that individuals with no education (x=0) have standard deviation of income of 1.46. Theta2 or 0.11 represents the slope and means that increasing value of x by 1 will result an increased standard deviation of income by 0.11. Statistically significant positive relationship between independent and dependent variables was also observed in respect to the standard deviation.
ols.lf2 <- function(param) {
mu <- param[1]
theta <- param[-1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate)
sigma <- x%*%theta
sum(dnorm(y, mu, sigma, log = TRUE))}
library(maxLik)
mle_ols2 <- maxLik(logLik = ols.lf2, start = c(mu = 1, theta1 = 1, theta2 = 1))
summary(mle_ols2)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 9 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -4861.964
3 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
mu 3.516764 0.070320 50.01 <2e-16 ***
theta1 1.461011 0.106745 13.69 <2e-16 ***
theta2 0.109081 0.009185 11.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
Above, we have observed that the more educated individuals are, the higher income they will earn. We also have noticed that the more education they have, the more variation in their income.
If we add a second independent variable, age, the outcomes may lead us to different conclusions. I hypothesize that age will not have much impact on mean income but have great impact on income inequality or variation in income. The reason behind my argument is that individuals with higher education will have more opportunities to get offered a better job or to get promoted as they get older; while, it will less likely to happen to those with lower or no education as they get older.
#adding a second independent variable, AGE
ols.lf3 <- function(param) {
beta <- param[-1]
sigma <- param[1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate, turnout$age)
mu <- x%*%beta
sum(dnorm(y, mu, sigma, log = TRUE))}
library(maxLik)
mle_ols3 <- maxLik(logLik = ols.lf3, start = c(sigma = 1, beta1 = 1, beta2 = 1, beta3 = 1))
summary(mle_ols3)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 16 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -4690.815
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
sigma 2.525576 0.039919 63.268 <2e-16 ***
beta1 -0.446047 0.300583 -1.484 0.138
beta2 0.371011 0.017493 21.209 <2e-16 ***
beta3 -0.003184 0.003373 -0.944 0.345
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
The outcome shows the y-intercept of -0.446, beta2 of 0.37, and beta3 of -0.003. The numbers represent the mean income of -0.446 when educate and age are 0; the mean income increases by 0.37 when educate is increased by 1; and the mean income decreases by 0.003 when age is increased by 1. It seems like my hypothesis is right, but the relationship between age and income is not statistically significant. The positive relationship betwwen educate and income is still statistically significant.
#adding a second independent varaible, AGE
ols.lf4 <- function(param) {
mu <- param[1]
theta <- param[-1]
y <- as.vector(turnout$income)
x <- cbind(1, turnout$educate, turnout$age)
sigma <- x%*%theta
sum(dnorm(y, mu, sigma, log = TRUE))}
library(maxLik)
mle_ols4 <- maxLik(logLik = ols.lf4, start = c(mu = 1, theta1 = 1, theta2 = 1, theta3 = 1), method="bfgs")
summary(mle_ols4)
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization, 150 iterations
Return code 0: successful convergence
Log-Likelihood: -4843.15
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
mu 3.555011 0.069193 51.378 < 2e-16 ***
theta1 0.362114 0.204550 1.770 0.0767 .
theta2 0.133349 0.010756 12.398 < 2e-16 ***
theta3 0.017507 0.002852 6.139 8.32e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
The outcome shows the y-intercept of 0.36, theta2 of 0.13, and theta3 of 0.018. The numbers represent the standard deviation of income is 0.36 when two independent variables are 0; its standard deviation increases by 0.13 when educate is increased by 1; and its standard deviation increases by 0.18 when age is increased by 1. Both positive relationships are statistically significant. However, the age does not seem to have a great impact on income inequality as I expected.
library(ggplot2)
ggplot(turnout)+
geom_point(aes(x = age, y = income)) + geom_smooth(aes(x = age, y = income)) +
labs(title = "Income Variation by Age")
ggplot(turnout)+
geom_point(aes(x = educate, y = income)) + geom_smooth(aes(x = educate, y = income)) +
labs(title = "Income Variation by Education")