Soc 712 | Assignment 3 | Maximum Likelihood Estimation

Maximum Likelihood Estimaion (MLE)

Introduction

MLE can be defined as a method for estimating population parameters (the mean and variance for normal distribution) from sample data such that the likelihood or probability of obtaining the observed data is maximized. When using MLE we need to first determine if the probability distribution is discrete or continuous. If a discrete distribution (i.e. race, class, gender etc.) is used, it can be described by Probability Mass Function (PMF) and if a continuous distribution (i.e. income, height, temperature etc.) is used, it can be described by Probability Density Function (PDF).

The following analyses are done using the “turnout” data from Zelig package. Since the dependent variable is income (continuous), PDF is used. The two MLE models - log-likelihood and likelihood functions - predict different things, but they are both useful in understanding what is going on with the data.

First of all, let’s import necessary packages

library(Zelig)

## Loading required package: survival

library(maxLik)

## Loading required package: miscTools

## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:Zelig':
## 
##     stat

data(turnout)

Here is a glimpse of the dataset that will be analyzed

head(turnout)

##    race age educate income vote
## 1 white  60      14 3.3458    1
## 2 white  51      10 1.8561    0
## 3 white  24      12 0.6304    0
## 4 white  38       8 3.4183    1
## 5 white  25      12 2.7852    1
## 6 white  67      12 2.3866    1

Let’s do a simple BAR GRAPH as initial analysis

turnout%>%
  group_by(educate)%>%
  summarize(mean_income=mean(income))%>%
  ggplot()+
  geom_col(aes(x=educate, y=mean_income, fill=mean_income))+
  theme(legend.position="none")

Interpretation: The bar graph shows that as education of the participants in the sample data increases, their mean income increases.

Log-likelihood function

logLikFun <- function(param) {
    mu <- param[1]
    sigma <- param[2]
    sum(dnorm(x, mean = mu, sd = sigma, log = TRUE))
    }

Log-likelihood function (Income and Education)

ols.lf <- function(param) {
  beta <- param[-1]
  sigma <- param[1]
  y <- as.vector(turnout$income)
  x <- cbind(1, turnout$educate)
  mu <- x%*%beta
  sum(dnorm(y, mu, sigma, log = TRUE))}
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1))
summary(mle_ols)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 12 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -4691.256 
## 3  free parameters
## Estimates:
##       Estimate Std. error t value Pr(> t)    
## sigma  2.52613    0.03989  63.326 < 2e-16 ***
## beta1 -0.65207    0.20827  -3.131 0.00174 ** 
## beta2  0.37613    0.01663  22.612 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Interpretation: The above analysis (slide 4.15) uses the log-likelihood function, which shows the effect of education on income. Here, education is the independent variable while income is the dependent variable. The three parameters in this analysis are sigma, beta1 and beta2.

The standard deviation is 2.53. Beta1 represents the y-intercept - which indicates that voters who have no education have an income of -0.65. Beta2 is the slope and it shows the effect of education on income, The value of beta2 is 0.38, which means for every one unit increase in education, a person’s income in this dataset is expected to increase by 0.38 units. Therefore, the variables education and income are positively correlated. The results for both beta1 and beta2 estimates are statistically significant at an alpha value of 0.05.

Likelihood function (Income and Education)

ols.lf2 <- function(param){
    mu <- param[1]
    theta <- param[-1]
    y <- as.vector(turnout$income)
    x <- cbind(1, turnout$educate)
    sigma <- x%*%theta
    sum(dnorm(y, mu, sigma, log = TRUE))
}

In appearance, this function is similar to the log-likelihood function. The independent and dependent variables are still the same here. However, beta and sigma have been replaced with mu and theta. Here, we are not looking for means, but standard deviations across the levels of education.

mle_ols2 <- maxLik(logLik = ols.lf2, start = c(mu = 1, theta1 = 1, theta2 = 1))
summary(mle_ols2)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 9 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -4861.964 
## 3  free parameters
## Estimates:
##        Estimate Std. error t value Pr(> t)    
## mu     3.516764   0.070320   50.01  <2e-16 ***
## theta1 1.461011   0.106745   13.69  <2e-16 ***
## theta2 0.109081   0.009185   11.88  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Interpretation: This analysis (slide 4.19) uses the likelihood function to show the effect education has on the variability of income. The three parameters in this analysis are mu, theta1 and theta2. Here, standard deviation is split into two: intercept (theta1) and slope (theta2). The mean income (u) is 3.52 units. Theta1 is the y-intercept - which indicates that voters with no education have a 1.46 variability on income. Theta2 represents the effect education has on the standard deviation on income. The value of theta2 is 0.11, which means that one unit increase in education results in an increase of 0.11 unit increase in the standard deviation of income. Results for both theta1 and theta2 are statistically significant at an alpha value of 0.05.

Adding a Second Variable - AGE

Adding in a second independent variable, will increase the validity of MLE. The more independent variables there is, the more precise our estimate will be. If another variable, age, was added into these two models, there would be another beta, and theta added.

Prediction

My prediction is that age and income will be positively correlated. Also, i think the education coefficient will decrease as the age coefficient would be explaining for a good portion of income now. However, age and education are interrerlated, therefore, the significance of adding the variable age to the previous ewuation is uncertain.

Log-likelihood function (Income, Education and Age)

ols.lf <- function(param) {
  beta <- param[-1] #Regression Coefficients
  sigma <- param[1] #Standard Deviation
  y <- as.vector(turnout$income) #DV
  x <- cbind(1, turnout$educate, turnout$age) #IV
  mu <- x%*%beta #multiply matrices
  sum(dnorm(y, mu, sigma, log = TRUE)) #normal distribution(vector of observations, mean, sd)
  }    
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1, beta3=1))
summary(mle_ols)

## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 16 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -4690.815 
## 4  free parameters
## Estimates:
##        Estimate Std. error t value Pr(> t)    
## sigma  2.525576   0.039919  63.268  <2e-16 ***
## beta1 -0.446047   0.300583  -1.484   0.138    
## beta2  0.371011   0.017493  21.209  <2e-16 ***
## beta3 -0.003184   0.003373  -0.944   0.345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Here again, sigma the standard deviation of income, beta1 is the intercept, beta 2 is the slope for education and beta 3 is the slope for age.

Interpretation: The results indicate that the age coefficient is negative, meaning that my prediction was wrong regarding the correlation. In fact, age and income is actually negatively correlated. As age increases, income is expected to decrease. But this finding is not statistically significant at all, with a p-value of 0.348, and therefore, is valueless.

The education coefficient did decreased just a tiny bit, from 0.376 to 0.371, meaning that for every additional unit of education there will be a 0.37 unit increase in income. The standard deviation is estimated to remains the same as before, at a value of 2.52. Education is still statistically significant in predicting income at an alpha value of 0.05.

Measuring Income Inequality Across Age and Education

Likelihood function (Income, Age, Education)

ols.lf2 <- function(param) {
  mu <- param[1]
  theta <- param[-1]
  y <- as.vector(turnout$income) #DV
  x <- cbind(1, turnout$educate, turnout$age) #IV
  sigma <- x%*%theta   #multiply matrices
  sum(dnorm(y, mu, sigma, log = TRUE)) #normal distribution(vector of observations, mean, sd)
  }
mle_ols2 <- maxLik(logLik = ols.lf2, start = c(mu = 1, theta1 = 1, theta2 = 1, theta3 = 1), method="BFGS")
summary(mle_ols2)

## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 150 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -4843.15 
## 4  free parameters
## Estimates:
##        Estimate Std. error t value  Pr(> t)    
## mu     3.555011   0.069193  51.378  < 2e-16 ***
## theta1 0.362114   0.204550   1.770   0.0767 .  
## theta2 0.133349   0.010756  12.398  < 2e-16 ***
## theta3 0.017507   0.002852   6.139 8.32e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Interpretation: The results show the effect education has on the variability of income and age. The mean income (u) is 3.55 units - a slight increase than before. Theta 1 shows that income will be 0.36 unit for those with no education at all. Theta2 represents the effect education has on the standard deviation on income. The value of theta2 increased from 0.11 to 0.13, which means that with the current model, one unit increase in education results in an increase of 0.13 unit increase in the standard deviation of income. Results for both theta2 and theta3 are statistically significant at an alpha value of 0.05.Therefore, we can conclude that an increase in age and education is likely to cause inequality in income for the participants in our sample data.