Linear Model

Installing and loading the packages

library(maxLik)
library(Zelig)
library(ggplot2)
library(coefplot)
library(ggthemes)
library(dplyr)

# Opening the "turnout" data set from the 'Zelig' package
library(Zelig)
data(turnout)
head(turnout)
##    race age educate income vote
## 1 white  60      14 3.3458    1
## 2 white  51      10 1.8561    0
## 3 white  24      12 0.6304    0
## 4 white  38       8 3.4183    1
## 5 white  25      12 2.7852    1
## 6 white  67      12 2.3866    1

Running preliminary analitycs of the “turnout” data set to understand and discover the data

Plotting a graph that shows the relationship between education and income

library(ggplot2)
ggplot(turnout, aes(x=educate, y=income)) + geom_point(color="orange") + labs(x="Educate", y="Income", title="Relationship between Education and Income")

Plotting a graph that shows the relationship between age and income

ggplot(turnout, aes(x=age, y=income)) + geom_point(color="orange") + geom_smooth(aes(x = age, y = income)) + labs(x="Age", y="Income", title="Relationship between Age and Income")

Ploting a matrix graph showing the relationship with all 3 variables

turnout1 <- turnout [,c (2:4)]
plot(turnout1, pch=16, col="grey", main="Matrix Scatterplot of Income, Education and Age")

turnout%>%
  group_by(educate)%>%
 summarize(av.income=mean(income), av.age= mean(age))%>%
  ggplot()+
      geom_col(aes(x=educate, y=av.income, fill=av.age))+
      theme_calc()

The summary of turnout data

summary(turnout)
##      race           age          educate          income      
##  others: 292   Min.   :17.0   Min.   : 0.00   Min.   : 0.000  
##  white :1708   1st Qu.:31.0   1st Qu.:10.00   1st Qu.: 1.744  
##                Median :42.0   Median :12.00   Median : 3.351  
##                Mean   :45.3   Mean   :12.07   Mean   : 3.887  
##                3rd Qu.:59.0   3rd Qu.:14.00   3rd Qu.: 5.233  
##                Max.   :95.0   Max.   :19.00   Max.   :14.925  
##       vote      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :0.746  
##  3rd Qu.:1.000  
##  Max.   :1.000
lm(turnout$income ~ turnout$educate)
## 
## Call:
## lm(formula = turnout$income ~ turnout$educate)
## 
## Coefficients:
##     (Intercept)  turnout$educate  
##         -0.6521           0.3761
lm(turnout$income ~ turnout$age)
## 
## Call:
## lm(formula = turnout$income ~ turnout$age)
## 
## Coefficients:
## (Intercept)  turnout$age  
##     5.03117     -0.02527

Implementing the Log-Likelihood Function showing Age and Educate as independent variables that explain the standard deviation of Income

ols.lf2 <- function(param) {
  mu <- param[1]
  theta <- param[-1]
  y <- as.vector(turnout$income)
  x <- cbind(1, turnout$educate, turnout$age)
  sigma <- x%*%theta
  sum(dnorm(y, mu, sigma, log = TRUE))
}  

The Maximum Likelihood Estimation Result

?maxLik    
##to open help for MLE package. Check the method options to choose the most robust optimisation method.

library(maxLik)
mle_ols2 <- maxLik(logLik = ols.lf2, start = c(mu = 1, theta1 = 1, theta2 = 1, theta3= 1), method="BFGS")
summary(mle_ols2)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 150 iterations
## Return code 0: successful convergence 
## Log-Likelihood: -4843.15 
## 4  free parameters
## Estimates:
##        Estimate Std. error t value  Pr(> t)    
## mu     3.555011   0.069193  51.378  < 2e-16 ***
## theta1 0.362114   0.204550   1.770   0.0767 .  
## theta2 0.133349   0.010756  12.398  < 2e-16 ***
## theta3 0.017507   0.002852   6.139 8.32e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
confint(mle_ols2)
##              2.5 %     97.5 %
## mu      3.41939485 3.69062797
## theta1 -0.03879737 0.76302499
## theta2  0.11226813 0.15442979
## theta3  0.01191727 0.02309622
##"confint" compute the confidence intervals. It is showing several 95% intervals (two-tailed, from the 2.5% point to the 97.5% point of the relevant distribution, which form the upper and lower limits of the intervals)

Plotting a graph showing the correlation coefficient between the variables

library(coefplot)
coefplot(mle_ols2)

Interpreting the results

The output show a maximum likelihood estimation of the 3 variables from the turnout data. It presents the impact of age and education on the standart deviation of income:

mu = average income for population

theta1 = the intercept

theta2 = impact of education on sd of income

theta3 = impact of age on sd of income

We can notice from the data that the correlation between both, the age and income and the education and income is positive. For each unit of age the diversity in income increases 0.01 and for each unit of education the diversity of income increases 0.13. This means that education has stronger impact on the diversity of income (With increase of education the difference among people’ income also incresae) and the size of standard deviation of income than age: 0.13 > 0.01.

The output also show that all the results are statistically significant with a p-value smaller than 0.001.

The End